This document discusses building data products at LinkedIn using Hadoop. It describes how LinkedIn builds recommendations products like "People You May Know" by processing member connection data with Hadoop tools. The workflow involves using Kafka to transfer data to HDFS, Pig and MapReduce to process the data, Azkaban to manage Hadoop jobs, and Voldemort to store results and serve recommendations to members. Triangle closing algorithms in Pig are used to find common connections between members and predict potential new connections. The results are pushed to production systems to power features like "People You May Know" recommendations.
Convert to study guideBETA
Transform any presentation into a summarized study guide, highlighting the most important points and key insights.
9. Data Products: Key Ideas
Recommendations
People You May Know, Viewers of this pro?le ...
Analytics and Insight
Pro?le Stats: Who Viewed My Pro?le, Skills
Visualization
InMaps
9
9
10. Data Products: Challenges
LinkedIn: 2nd largest social network
120 million members on LinkedIn
Billions of connections
Billions of pageviews
Terabytes of data to process
10
10
11. Outline
What do I mean by Data Products?
Systems and Tools we use
Lets build People You May Know
Managing work?ow
Serving data in production
Data Quality
Performance 11
11
12. Systems and Tools
Kafka (LinkedIn)
Hadoop (Apache)
Azkaban (LinkedIn)
Voldemort (LinkedIn)
12
12
13. Systems and Tools
Kafka
publish-subscribe messaging system
transfer data from production to HDFS
Hadoop
Azkaban
Voldemort
13
13
17. Outline
What do I mean by Data Products?
Systems and Tools we use
Lets build People You May Know
Managing work?ow
Serving data in production
Data Quality
Performance 17
17
18. People You May Know
How do people Alice
know each other?
Bob Carol
18
18
19. People You May Know
How do people Alice
know each other?
Bob Carol
19
19
20. People You May Know
How do people Alice
know each other?
Bob Carol
Triangle closing
20
20
21. People You May Know
How do people Alice
know each other?
Bob Carol
Triangle closing
Prob(Bob knows Carol) ~ the # of common connections
21
21
22. Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
?atten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();
22
22
23. Pig Overview
Load: load data, specify format
Store: store data, specify format
Foreach, Generate: Projections, similar to select
Group by: group by column(s)
Join, Filter, Limit, Order, ...
User De?ned Functions (UDFs)
23
23
24. Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
?atten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();
24
24
25. Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
?atten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();
25
25
26. Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
?atten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();
26
26
27. Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
?atten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();
27
27
28. Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
?atten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();
28
28
29. Triangle Closing Example
Alice
Bob Carol
connections = LOAD `connections` USING
1.(A,B),(B,A),(A,C),(C,A) PigStorage();
2.(A,{B,C}),(B,{A}),(C,{A})
3.(A,{B,C}),(A,{C,B})
4.(B,C,1), (C,B,1)
29
29
30. Triangle Closing Example
Alice
Bob Carol
1.(A,B),(B,A),(A,C),(C,A)
group_conn = GROUP connections BY
2.(A,{B,C}),(B,{A}),(C,{A}) source_id;
3.(A,{B,C}),(A,{C,B})
4.(B,C,1), (C,B,1)
30
30
31. Triangle Closing Example
Alice
Bob Carol
1.(A,B),(B,A),(A,C),(C,A)
2.(A,{B,C}),(B,{A}),(C,{A})
pairs = FOREACH group_conn GENERATE
3.(A,{B,C}),(A,{C,B}) generatePair(connections.dest_id) as (id1, id2);
4.(B,C,1), (C,B,1)
31
31
32. Triangle Closing Example
Alice
Bob Carol
1.(A,B),(B,A),(A,C),(C,A)
2.(A,{B,C}),(B,{A}),(C,{A}) common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn
3.(A,{B,C}),(A,{C,B}) GENERATE ?atten(group) as (source_id, dest_id),
4.(B,C,1), (C,B,1) COUNT(pairs) as common_connections;
32
32
36. Outline
What do I mean by Data Products?
Systems and Tools we use
Lets build People You May Know
Managing work?ow
Serving data in production
Data Quality
Performance 36
36
49. Outline
What do I mean by Data Products?
Systems and Tools we use
Lets build People You May Know
Managing work?ow
Serving data in production
Data Quality
Performance
49
49
50. Production Storage
Requirements
Large amount of data/Scalable
Quick lookup/low latency
Versioning and Rollback
Fault tolerance
Of?ine index building
50
50
51. Voldemort Storage
Large amount of data/Scalable
Quick lookup/low latency
Versioning and Rollback
Fault tolerance through replication
Read only
Of?ine index building
51
51
55. Outline
What do I mean by Data Products?
Systems and Tools we use
Lets build People You May Know
Managing work?ow
Serving data in production
Data Quality
Performance 55
55
57. Outline
What do I mean by Data Products?
Systems and Tools we use
Lets build People You May Know
Managing work?ow
Serving data in production
Data Quality
Performance 57
57
61. Performance
Symmetry
Bob knows Carol then Carol knows Bob
Limit
Ignore members with > k connections
Sampling
Sample k-connections
58
58
62. Things Covered
What do I mean by Data Products?
Systems and Tools we use
Lets build People You May Know
Managing work?ow
Serving data in production
Data Quality
Performance 59
59
63. SNA Team
Thanks to SNA Team at LinkedIn
http://sna-projects.com
We are hiring!
60
60