Vikram Oberoi presented lessons learned from using Hadoop in production at Meebo. He discussed how Meebo transitioned to using Hadoop for ETL and analytics due to the large volume of log data they process daily. He emphasized the importance of using a workflow manager like Azkaban to automate jobs and dependencies rather than scripts, and of using a backwards-compatible data serialization format like Protocol Buffers to avoid issues when data schemas change over time.
1 of 97
Downloaded 249 times
More Related Content
Hadoop at Meebo: Lessons in the Real World
1. Hadoop at MeeboLessons learned in the real worldVikram OberoiAugust, 2010Hadoop Day, Seattle
2. About meSDE Intern at Amazon, 07R&D on item-to-item similaritiesData Engineer Intern at Meebo, 08Built an A/B testing systemCS at Stanford, 09Senior project: Ext3 and XFS under HadoopMapReduce workloadsData Engineer at Meebo, 09presentData infrastructure, analytics
3. About MeeboProductsBrowser-based IM client (www.meebo.com)Mobile chat clientsSocial widgets (the Meebo Bar)CompanyFounded 2005Over 100 employees, 30 engineersEngineeringStrong engineering cultureContributions to CouchDB, Lounge, Hadoop components
4. The ProblemHadoop is powerful technologyMeets todays demand for big dataBut its still a young platformEvolving components and best practicesWith many challenges in real-world usageDay-to-day operational headachesMissing eco-system features (e.g recurring jobs?)Lots of re-inventing the wheel to solve these
5. Purpose of this talkDiscuss some real problems weve seenExplain our solutionsPropose best practices so you can avoid them
6. What will I talk about?Background:Meebos data processing needsMeebos pre and post Hadoop data pipelinesLessons:Better workflow managementScheduling, reporting, monitoring, etc.A look at AzkabanGet wiser about data serializationProtocol Buffers (or Avro, or Thrift)
8. What do we use Hadoop for?ETLAnalyticsBehavioral targetingAd hoc data analysis, researchData produced helps power:internal/external dashboardsour ad server
9. What kind of data do we have?Log data from all our productsThe Meebo BarMeebo Messenger (www.meebo.com)Android/iPhone/Mobile Web clientsRoomsMeebo MeMeebonotifierFirefox extension
10. How much data?150MM uniques/month from the Meebo BarAround 200 GB of uncompressed daily logsWe process a subset of our logs
12. A data pipeline in general1. DataCollection2. DataProcessing3. DataStorage4. Workflow Management
13. Our data pipeline, pre-HadoopServersPython/shell scripts pull log dataPython/shell scripts process dataMySQL, CouchDB, flat filesCron, wrapper shell scripts glue everything together
14. Our data pipeline post HadoopServersPush logs to HDFSPig scripts process dataMySQL, CouchDB, flat filesAzkaban, a workflow management system, glues everything together
15. Our transition to using HadoopDeployed early 09Motivation: processing data took aaaages!Catalyst: Hadoop SummitTurbulent, time consumingNew tools, new paradigms, pitfallsTotally worth it24 hours to process days logs under an hourLeap in ability to analyze our dataBasis for new core product features
18. What is workflow management?Its the glue that binds your data pipeline together: scheduling, monitoring, reporting etc.Most people use scripts and cronBut end up spending too much time managingWe need a better way
22. Monolithic scripts are no fun Clean up data from log AProcess data from log BJoin data, train a classifierPost-processingArchive outputExport to DB somewhere
23. Workflow management consists of:Executes jobs with arbitrarily complex dependency chainsSchedules recurring jobs to run at a given time
24. Workflow management consists of:Executes jobs with arbitrarily complex dependency chainsSchedules recurring jobs to run at a given time Monitors job progress
25. Workflow management consists of:Executes jobs with arbitrarily complex dependency chainsSchedules recurring jobs to run at a given time Monitors job progressReports when job fails, how long jobs take
26. Workflow management consists of:Executes jobs with arbitrarily complex dependency chainsSchedules recurring jobs to run at a given time Monitors job progressReports when job fails, how long jobs takeLogs job execution and exposes logs so that engineers can deal with failures swiftly
27. Workflow management consists of:Executes jobs with arbitrarily complex dependency chainsSchedules recurring jobs to run at a given time Monitors job progressReports when job fails, how long jobs takeLogs job execution and exposes logs so that engineers can deal with failures swiftlyProvides resource management capabilities
28. Export to DB somewhereExport to DB somewhereExport to DB somewhereExport to DB somewhereExport to DB somewhereDB somewhereDont DoS yourself
29. Export to DB somewhereExport to DB somewhereExport to DB somewhereExport to DB somewhereExport to DB somewhere21000Permit ManagerDB somewhere
30. Dont roll your own scheduler!Building a good scheduling framework is hardMyriad of small requirements, precise bookkeeping with many edge casesMany roll their ownIts usually inadequateSo much repeated effort!Mold an existing framework to your requirements and contribute
31. Two emerging frameworksOozieBuilt at YahooOpen-sourced at Hadoop Summit 10Used in production for [dont know]Packaged by ClouderaAzkabanBuilt at LinkedInOpen-sourced in March 10Used in production for over nine months as of March 10Now in use at Meebo
51. What is type=command?Azkaban supports a few ways to execute jobscommandUnix command in a separate processjavaprocessWrapper to kick off Java programsjavaWrapper to kick off Runnable Java classesCan hook into Azkaban in useful waysPigWrapper to run Pig scripts through Grunt
56. 3:00 PM failed, restarted at 4:25 PM4:00 PMFOOFOO5:00 PM
57. Whats missing?Scheduling and executing multiple jobs at the same time.AZK-49, AZK-47Stay tuned for complete, reviewed patch branches: www.github.com/voberoi/azkaban
58. Whats missing?Scheduling and executing multiple jobs at the same time.AZK-49, AZK-47Stay tuned for complete, reviewed patch branches: www.github.com/voberoi/azkabanPassing arguments between jobs.Write a library used by your jobsPut your arguments anywhere you want
59. What did we get out of it?No more monolithic wrapper scriptsMassively reduced job setup timeIts configuration, not code!More code reuse, less hair pullingStill porting over jobsIts time consuming
61. Whats the problem?Serializing data in simple formats is convenientCSV, XML etc.Problems arise when data changesNeeds backwards-compatibilityDoes this really matter? Lets discuss.
63. Click a Button Analytics PRDWe want to know the number of unique users who clicked on the button.Over an arbitrary range of time.Broken down by whether theyre logged in or not.With hour granularity.
64. I KNOW!Every hour, process logs and dump lines that look like this to HDFS with Pig:unique_id,logged_in,clicked
65. I KNOW!--clicked and logged_in are either 0 or 1LOAD $IN USING PigStorage(,) AS (unique_id:chararray,logged_in:int,clicked:int);-- Munge data according to the PRD
67. Click a Button Analytics PRDBreak users down by which button they clicked, too.
68. I KNOW!Every hour, process logs and dump lines that look like this to HDFS with Pig:unique_id,logged_in,red_click,green_click
69. I KNOW!--clicked and logged_in are either 0 or 1LOAD $IN USING PigStorage(.) AS (unique_id:chararray,logged_in:int,red_clicked:int,green_clicked:int);-- Munge data according to the PRD
72. Bad Solution 1Remove red_clickunique_id,logged_in,red_click,green_clickunique_id,logged_in,green_click
73. Why its badYour script thinks green clicks are red clicks.LOAD $IN USING PigStorage(.) AS (unique_id:chararray,logged_in:int,red_clicked:int,green_clicked:int);-- Munge data according to the PRD
74. Why its badNow your script wont work for all the data youve collected so far.LOAD $IN USING PigStorage(.) AS (unique_id:chararray,logged_in:int,green_clicked:int);-- Munge data according to the PRD
76. LOAD $IN USING PigStorage(.) AS (unique_id:chararray,logged_in:int,green_clicked:int);My data has three fields. Which one do I use?LOAD $IN USING PigStorage(.) AS (unique_id:chararray,logged_in:int,orange_clicked:int);
77. Bad Solution 2Assign a sentinel to red_clickwhen it should be ignored, i.e. -1. unique_id,logged_in,red_click,green_click
81. Protocol BuffersSerialization systemAvro, ThriftCompiles interfaces to language modulesConstruct a data structureAccess it (in a backwards-compatible way)Ser/deser the data structure in a standard, compact, binary format
91. Every hour, process logs and dump lines to HDFS that use this protobuf interface:uniqueuser.protomessage UniqueUser {optional string id = 1;optional int32 logged_in = 2;optional int32 red_clicked = 3;}
92. --clicked and logged_in are either 0 or 1LOAD $IN USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader AS (unique_id:chararray,logged_in:int,red_clicked:int);-- Munge data according to the PRD
94. Every hour, process logs and dump lines to HDFS that use this protobuf interface:uniqueuser.protomessage UniqueUser {optional string id = 1;optional int32 logged_in = 2;optional int32 red_clicked = 3;optional int32 green_clicked = 4;}
95. --clicked and logged_in are either 0 or 1LOAD $IN USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader AS (unique_id:chararray,logged_in:int,red_clicked:int,green_clicked:int);-- Munge data according to the PRD
99. ConclusionWorkflow managementUse Azkaban, Oozie, or another framework.Dont use shell scripts and cron.Do this from day one! Transitioning expensive.Data serializationUse Protocol Buffers, Avro, Thrift. Something else!Do this from day one before it bites you.