ݺߣ

ݺߣShare a Scribd company logo
Building Data Products
using Hadoop at Linkedin
                Mitul Tiwari
    Search, Network, and Analytics (SNA)
                 LinkedIn
                     1
                                           1
Who am I?




    2
            2
What do I mean by Data Products?




               3
                                   3
People You May Know




         4
                      4
Pro?le Stats: WVMP




        5
                     5
Viewers of this pro?le also ...




               6
                                  6
Skills




  7
         7
InMaps




  8
         8
Data Products: Key Ideas

Recommendations
 People You May Know, Viewers of this pro?le ...

Analytics and Insight
 Pro?le Stats: Who Viewed My Pro?le, Skills

Visualization
 InMaps

                       9
                                                   9
Data Products: Challenges

 LinkedIn: 2nd largest social network

 120 million members on LinkedIn

 Billions of connections

 Billions of pageviews

 Terabytes of data to process

                      10
                                        10
Outline
What do I mean by Data Products?

Systems and Tools we use

Lets build People You May Know

Managing work?ow

Serving data in production

Data Quality

Performance          11
                                    11
Systems and Tools

Kafka (LinkedIn)

Hadoop (Apache)

Azkaban (LinkedIn)

Voldemort (LinkedIn)


                     12
                          12
Systems and Tools
Kafka
 publish-subscribe messaging system

 transfer data from production to HDFS

Hadoop

Azkaban

Voldemort

                      13
                                         13
Systems and Tools
Kafka

Hadoop
 Java MapReduce and Pig

 process data

Azkaban

Voldemort

                    14
                          14
Systems and Tools
Kafka

Hadoop

Azkaban
 Hadoop work?ow management tool

 to manage hundreds of Hadoop jobs

Voldemort

                     15
                                     15
Systems and Tools
Kafka

Hadoop

Azkaban

Voldemort
 Key-value store

 store output of Hadoop jobs and serve in production

                      16
                                                       16
Outline
What do I mean by Data Products?

Systems and Tools we use

Lets build People You May Know

Managing work?ow

Serving data in production

Data Quality

Performance          17
                                    17
People You May Know
 How do people            Alice
know each other?



               Bob                Carol




                     18
                                          18
People You May Know
 How do people            Alice
know each other?



               Bob                Carol




                     19
                                          19
People You May Know
 How do people                 Alice
know each other?



               Bob                     Carol



                   Triangle closing


                          20
                                               20
People You May Know
 How do people                Alice
know each other?



               Bob                    Carol



                 Triangle closing
Prob(Bob knows Carol) ~ the # of common connections

                         21
                                                      21
Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
        generatePair(connections.dest_id) as (id1, id2);

common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
              ?atten(group) as (source_id, dest_id),
              COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();


                                      22
                                                                   22
Pig Overview
Load: load data, specify format

Store: store data, specify format

Foreach, Generate: Projections, similar to select

Group by: group by column(s)

Join, Filter, Limit, Order, ...

User De?ned Functions (UDFs)
                        23
                                                    23
Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
        generatePair(connections.dest_id) as (id1, id2);

common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
              ?atten(group) as (source_id, dest_id),
              COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();


                                      24
                                                                   24
Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
        generatePair(connections.dest_id) as (id1, id2);

common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
              ?atten(group) as (source_id, dest_id),
              COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();


                                      25
                                                                   25
Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
        generatePair(connections.dest_id) as (id1, id2);

common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
              ?atten(group) as (source_id, dest_id),
              COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();


                                      26
                                                                   26
Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
        generatePair(connections.dest_id) as (id1, id2);

common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
              ?atten(group) as (source_id, dest_id),
              COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();


                                      27
                                                                   27
Triangle Closing in Pig
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
        generatePair(connections.dest_id) as (id1, id2);

common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
              ?atten(group) as (source_id, dest_id),
              COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();


                                      28
                                                                   28
Triangle Closing Example
                                   Alice




                  Bob                       Carol

                               connections = LOAD `connections` USING
1.(A,B),(B,A),(A,C),(C,A)      PigStorage();
2.(A,{B,C}),(B,{A}),(C,{A})
3.(A,{B,C}),(A,{C,B})
4.(B,C,1), (C,B,1)
                              29
                                                                        29
Triangle Closing Example
                                    Alice




                  Bob                         Carol


1.(A,B),(B,A),(A,C),(C,A)
                              group_conn = GROUP connections BY
2.(A,{B,C}),(B,{A}),(C,{A})   source_id;
3.(A,{B,C}),(A,{C,B})
4.(B,C,1), (C,B,1)
                               30
                                                                  30
Triangle Closing Example
                                     Alice




                  Bob                             Carol


1.(A,B),(B,A),(A,C),(C,A)
2.(A,{B,C}),(B,{A}),(C,{A})
                              pairs = FOREACH group_conn GENERATE
3.(A,{B,C}),(A,{C,B})         generatePair(connections.dest_id) as (id1, id2);
4.(B,C,1), (C,B,1)
                                31
                                                                                 31
Triangle Closing Example
                                     Alice




                  Bob                           Carol


1.(A,B),(B,A),(A,C),(C,A)
2.(A,{B,C}),(B,{A}),(C,{A})   common_conn = GROUP pairs BY (id1, id2);
                              common_conn = FOREACH common_conn
3.(A,{B,C}),(A,{C,B})         GENERATE ?atten(group) as (source_id, dest_id),
4.(B,C,1), (C,B,1)            COUNT(pairs) as common_connections;
                                32
                                                                            32
Our Work?ow

 triangle-closing




            33
                    33
Our Work?ow

 triangle-closing




     top-n




             34
                    34
Our Work?ow

 triangle-closing




     top-n




  push-to-prod



             35
                    35
Outline
What do I mean by Data Products?

Systems and Tools we use

Lets build People You May Know

Managing work?ow

Serving data in production

Data Quality

Performance          36
                                    36
Our Work?ow

 triangle-closing




     top-n




  push-to-prod



             37
                    37
Our Work?ow
 triangle-closing


    remove
  connections



      top-n



  push-to-prod

              38
                    38
Our Work?ow
              triangle-closing


                 remove
               connections



                   top-n



push-to-qa     push-to-prod

                           39
                                 39
PYMK Work?ow




     40
               40
Work?ow Requirements
Dependency management
Regular Scheduling
Monitoring
Diverse jobs: Java, Pig, Clojure
Con?guration/Parameters
Resource control/locking
Restart/Stop/Retry
Visualization
History
Logs
                         41
                                   41
Work?ow Requirements
Dependency management
Regular Scheduling
Monitoring
Diverse jobs: Java, Pig, Clojure
Con?guration/Parameters
Resource control/locking
Restart/Stop/Retry
Visualization
History
                         Azkaban
Logs
                      42
                                   42
Sample Azkaban Job Spec
type=pig

pig.script=top-n.pig

dependencies=remove-connections

top.n.size=100




                       43
                                  43
Azkaban Work?ow




       44
                  44
Azkaban Work?ow




       45
                  45
Azkaban Work?ow




       46
                  46
Our Work?ow
 triangle-closing


    remove
  connections



      top-n



  push-to-prod

              47
                    47
Our Work?ow
 triangle-closing


    remove
  connections



      top-n



  push-to-prod

              48
                    48
Outline
What do I mean by Data Products?

Systems and Tools we use

Lets build People You May Know

Managing work?ow

Serving data in production

Data Quality

Performance
                     49
                                    49
Production Storage

Requirements
 Large amount of data/Scalable

 Quick lookup/low latency

 Versioning and Rollback

 Fault tolerance

 Of?ine index building

                         50
                                 50
Voldemort Storage

Large amount of data/Scalable

Quick lookup/low latency

Versioning and Rollback

Fault tolerance through replication

Read only

Of?ine index building

                        51
                                      51
Data Cycle




    52
             52
Voldemort RO Store




        53
                     53
Our Work?ow
 triangle-closing


    remove
  connections



      top-n



  push-to-prod

              54
                    54
Outline
What do I mean by Data Products?

Systems and Tools we use

Lets build People You May Know

Managing work?ow

Serving data in production

Data Quality

Performance          55
                                    55
Data Quality

Veri?cation

QA store with viewer

Explain

Versioning/Rollback

Unit tests

                      56
                            56
Outline
What do I mean by Data Products?

Systems and Tools we use

Lets build People You May Know

Managing work?ow

Serving data in production

Data Quality

Performance          57
                                    57
Performance




     58
              58
Performance

Symmetry
 Bob knows Carol then Carol knows Bob




                     58
                                        58
Performance

Symmetry
 Bob knows Carol then Carol knows Bob

Limit
 Ignore members with > k connections




                     58
                                        58
Performance

Symmetry
 Bob knows Carol then Carol knows Bob

Limit
 Ignore members with > k connections

Sampling
 Sample k-connections

                        58
                                        58
Things Covered
What do I mean by Data Products?

Systems and Tools we use

Lets build People You May Know

Managing work?ow

Serving data in production

Data Quality

Performance          59
                                    59
SNA Team


Thanks to SNA Team at LinkedIn

http://sna-projects.com

We are hiring!



                    60
                                 60
Questions?




    61
             61

More Related Content

Building Data Driven Products at Linkedin

  • 1. Building Data Products using Hadoop at Linkedin Mitul Tiwari Search, Network, and Analytics (SNA) LinkedIn 1 1
  • 2. Who am I? 2 2
  • 3. What do I mean by Data Products? 3 3
  • 4. People You May Know 4 4
  • 6. Viewers of this pro?le also ... 6 6
  • 9. Data Products: Key Ideas Recommendations People You May Know, Viewers of this pro?le ... Analytics and Insight Pro?le Stats: Who Viewed My Pro?le, Skills Visualization InMaps 9 9
  • 10. Data Products: Challenges LinkedIn: 2nd largest social network 120 million members on LinkedIn Billions of connections Billions of pageviews Terabytes of data to process 10 10
  • 11. Outline What do I mean by Data Products? Systems and Tools we use Lets build People You May Know Managing work?ow Serving data in production Data Quality Performance 11 11
  • 12. Systems and Tools Kafka (LinkedIn) Hadoop (Apache) Azkaban (LinkedIn) Voldemort (LinkedIn) 12 12
  • 13. Systems and Tools Kafka publish-subscribe messaging system transfer data from production to HDFS Hadoop Azkaban Voldemort 13 13
  • 14. Systems and Tools Kafka Hadoop Java MapReduce and Pig process data Azkaban Voldemort 14 14
  • 15. Systems and Tools Kafka Hadoop Azkaban Hadoop work?ow management tool to manage hundreds of Hadoop jobs Voldemort 15 15
  • 16. Systems and Tools Kafka Hadoop Azkaban Voldemort Key-value store store output of Hadoop jobs and serve in production 16 16
  • 17. Outline What do I mean by Data Products? Systems and Tools we use Lets build People You May Know Managing work?ow Serving data in production Data Quality Performance 17 17
  • 18. People You May Know How do people Alice know each other? Bob Carol 18 18
  • 19. People You May Know How do people Alice know each other? Bob Carol 19 19
  • 20. People You May Know How do people Alice know each other? Bob Carol Triangle closing 20 20
  • 21. People You May Know How do people Alice know each other? Bob Carol Triangle closing Prob(Bob knows Carol) ~ the # of common connections 21 21
  • 22. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE ?atten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 22 22
  • 23. Pig Overview Load: load data, specify format Store: store data, specify format Foreach, Generate: Projections, similar to select Group by: group by column(s) Join, Filter, Limit, Order, ... User De?ned Functions (UDFs) 23 23
  • 24. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE ?atten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 24 24
  • 25. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE ?atten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 25 25
  • 26. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE ?atten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 26 26
  • 27. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE ?atten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 27 27
  • 28. Triangle Closing in Pig -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE ?atten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); 28 28
  • 29. Triangle Closing Example Alice Bob Carol connections = LOAD `connections` USING 1.(A,B),(B,A),(A,C),(C,A) PigStorage(); 2.(A,{B,C}),(B,{A}),(C,{A}) 3.(A,{B,C}),(A,{C,B}) 4.(B,C,1), (C,B,1) 29 29
  • 30. Triangle Closing Example Alice Bob Carol 1.(A,B),(B,A),(A,C),(C,A) group_conn = GROUP connections BY 2.(A,{B,C}),(B,{A}),(C,{A}) source_id; 3.(A,{B,C}),(A,{C,B}) 4.(B,C,1), (C,B,1) 30 30
  • 31. Triangle Closing Example Alice Bob Carol 1.(A,B),(B,A),(A,C),(C,A) 2.(A,{B,C}),(B,{A}),(C,{A}) pairs = FOREACH group_conn GENERATE 3.(A,{B,C}),(A,{C,B}) generatePair(connections.dest_id) as (id1, id2); 4.(B,C,1), (C,B,1) 31 31
  • 32. Triangle Closing Example Alice Bob Carol 1.(A,B),(B,A),(A,C),(C,A) 2.(A,{B,C}),(B,{A}),(C,{A}) common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn 3.(A,{B,C}),(A,{C,B}) GENERATE ?atten(group) as (source_id, dest_id), 4.(B,C,1), (C,B,1) COUNT(pairs) as common_connections; 32 32
  • 35. Our Work?ow triangle-closing top-n push-to-prod 35 35
  • 36. Outline What do I mean by Data Products? Systems and Tools we use Lets build People You May Know Managing work?ow Serving data in production Data Quality Performance 36 36
  • 37. Our Work?ow triangle-closing top-n push-to-prod 37 37
  • 38. Our Work?ow triangle-closing remove connections top-n push-to-prod 38 38
  • 39. Our Work?ow triangle-closing remove connections top-n push-to-qa push-to-prod 39 39
  • 40. PYMK Work?ow 40 40
  • 41. Work?ow Requirements Dependency management Regular Scheduling Monitoring Diverse jobs: Java, Pig, Clojure Con?guration/Parameters Resource control/locking Restart/Stop/Retry Visualization History Logs 41 41
  • 42. Work?ow Requirements Dependency management Regular Scheduling Monitoring Diverse jobs: Java, Pig, Clojure Con?guration/Parameters Resource control/locking Restart/Stop/Retry Visualization History Azkaban Logs 42 42
  • 43. Sample Azkaban Job Spec type=pig pig.script=top-n.pig dependencies=remove-connections top.n.size=100 43 43
  • 47. Our Work?ow triangle-closing remove connections top-n push-to-prod 47 47
  • 48. Our Work?ow triangle-closing remove connections top-n push-to-prod 48 48
  • 49. Outline What do I mean by Data Products? Systems and Tools we use Lets build People You May Know Managing work?ow Serving data in production Data Quality Performance 49 49
  • 50. Production Storage Requirements Large amount of data/Scalable Quick lookup/low latency Versioning and Rollback Fault tolerance Of?ine index building 50 50
  • 51. Voldemort Storage Large amount of data/Scalable Quick lookup/low latency Versioning and Rollback Fault tolerance through replication Read only Of?ine index building 51 51
  • 52. Data Cycle 52 52
  • 54. Our Work?ow triangle-closing remove connections top-n push-to-prod 54 54
  • 55. Outline What do I mean by Data Products? Systems and Tools we use Lets build People You May Know Managing work?ow Serving data in production Data Quality Performance 55 55
  • 56. Data Quality Veri?cation QA store with viewer Explain Versioning/Rollback Unit tests 56 56
  • 57. Outline What do I mean by Data Products? Systems and Tools we use Lets build People You May Know Managing work?ow Serving data in production Data Quality Performance 57 57
  • 58. Performance 58 58
  • 59. Performance Symmetry Bob knows Carol then Carol knows Bob 58 58
  • 60. Performance Symmetry Bob knows Carol then Carol knows Bob Limit Ignore members with > k connections 58 58
  • 61. Performance Symmetry Bob knows Carol then Carol knows Bob Limit Ignore members with > k connections Sampling Sample k-connections 58 58
  • 62. Things Covered What do I mean by Data Products? Systems and Tools we use Lets build People You May Know Managing work?ow Serving data in production Data Quality Performance 59 59
  • 63. SNA Team Thanks to SNA Team at LinkedIn http://sna-projects.com We are hiring! 60 60
  • 64. Questions? 61 61