際際滷

際際滷Share a Scribd company logo
Hadoop at MeeboLessons learned in the real worldVikram OberoiAugust, 2010Hadoop Day, Seattle
About meSDE Intern at Amazon, 07R&D on item-to-item similaritiesData Engineer Intern at Meebo, 08Built an A/B testing systemCS at Stanford, 09Senior project: Ext3 and XFS under HadoopMapReduce workloadsData Engineer at Meebo, 09presentData infrastructure, analytics
About MeeboProductsBrowser-based IM client (www.meebo.com)Mobile chat clientsSocial widgets (the Meebo Bar)CompanyFounded 2005Over 100 employees, 30 engineersEngineeringStrong engineering cultureContributions to CouchDB, Lounge, Hadoop components
The ProblemHadoop is powerful technologyMeets todays demand for big dataBut its still a young platformEvolving components and best practicesWith many challenges in real-world usageDay-to-day operational headachesMissing eco-system features (e.g recurring jobs?)Lots of re-inventing the wheel to solve these
Purpose of this talkDiscuss some real problems weve seenExplain our solutionsPropose best practices so you can avoid them
What will I talk about?Background:Meebos data processing needsMeebos pre and post Hadoop data pipelinesLessons:Better workflow managementScheduling, reporting, monitoring, etc.A look at AzkabanGet wiser about data serializationProtocol Buffers (or Avro, or Thrift)
Meebos Data Processing Needs
What do we use Hadoop for?ETLAnalyticsBehavioral targetingAd hoc data analysis, researchData produced helps power:internal/external dashboardsour ad server
What kind of data do we have?Log data from all our productsThe Meebo BarMeebo Messenger (www.meebo.com)Android/iPhone/Mobile Web clientsRoomsMeebo MeMeebonotifierFirefox extension
How much data?150MM uniques/month from the Meebo BarAround 200 GB of uncompressed daily logsWe process a subset of our logs
Meebos Data PipelinePre and Post Hadoop
A data pipeline in general1. DataCollection2. DataProcessing3. DataStorage4. Workflow Management
Our data pipeline, pre-HadoopServersPython/shell scripts pull log dataPython/shell scripts process dataMySQL, CouchDB, flat filesCron, wrapper shell scripts glue everything together
Our data pipeline post HadoopServersPush logs to HDFSPig scripts process dataMySQL, CouchDB, flat filesAzkaban, a workflow management system, glues everything together
Our transition to using HadoopDeployed early 09Motivation: processing data took aaaages!Catalyst: Hadoop SummitTurbulent, time consumingNew tools, new paradigms, pitfallsTotally worth it24 hours to process days logs  under an hourLeap in ability to analyze our dataBasis for new core product features
Workflow Management
What is workflow management?
What is workflow management?Its the glue that binds your data pipeline together: scheduling, monitoring, reporting etc.Most people use scripts and cronBut end up spending too much time managingWe need a better way
Workflow management consists of:Executes jobs with arbitrarily complex dependency chains
Split up your jobs into discrete chunks with dependencies Minimize impact when chunks fail
 Allow engineers to work on chunks separately
 Monolithic scripts are no fun Clean up data from  log AProcess data from log BJoin data, train a classifierPost-processingArchive outputExport to DB somewhere
Workflow management consists of:Executes jobs with arbitrarily complex dependency chainsSchedules recurring jobs to run at a given time
Workflow management consists of:Executes jobs with arbitrarily complex dependency chainsSchedules recurring jobs to run at a given time Monitors job progress
Workflow management consists of:Executes jobs with arbitrarily complex dependency chainsSchedules recurring jobs to run at a given time Monitors job progressReports when job fails, how long jobs take
Workflow management consists of:Executes jobs with arbitrarily complex dependency chainsSchedules recurring jobs to run at a given time Monitors job progressReports when job fails, how long jobs takeLogs job execution and exposes logs so that engineers can deal with failures swiftly
Workflow management consists of:Executes jobs with arbitrarily complex dependency chainsSchedules recurring jobs to run at a given time Monitors job progressReports when job fails, how long jobs takeLogs job execution and exposes logs so that engineers can deal with failures swiftlyProvides resource management capabilities
Export to DB somewhereExport to DB somewhereExport to DB somewhereExport to DB somewhereExport to DB somewhereDB somewhereDont DoS yourself
Export to DB somewhereExport to DB somewhereExport to DB somewhereExport to DB somewhereExport to DB somewhere21000Permit ManagerDB somewhere
Dont roll your own scheduler!Building a good scheduling framework is hardMyriad of small requirements, precise bookkeeping with many edge casesMany roll their ownIts usually inadequateSo much repeated effort!Mold an existing framework to your requirements and contribute
Two emerging frameworksOozieBuilt at YahooOpen-sourced at Hadoop Summit 10Used in production for [dont know]Packaged by ClouderaAzkabanBuilt at LinkedInOpen-sourced in March 10Used in production for over nine months as of March 10Now in use at Meebo
Azkaban
Hadoop at Meebo: Lessons in the Real World
Hadoop at Meebo: Lessons in the Real World
Hadoop at Meebo: Lessons in the Real World
Azkaban jobs are bundles of configuration and code
Configuring a jobprocess_log_data.jobtype=commandcommand=python process_logs.pyfailure.emails=datateam@whereiwork.comprocess_logs.pyimportosimport sys# Do useful things
Deploying a jobStep 1: Shove your config and code into a zip archive.process_log_data.zip.job.py
Deploying a jobStep 2: Upload to Azkabanprocess_log_data.zip.job.py
Scheduling a jobThe Azkaban front-end:
What about dependencies?
get_users_widgetsprocess_widgets.jobprocess_users.jobjoin_users_widgets.jobexport_to_db.job
get_users_widgetsprocess_widgets.jobtype=commandcommand=python process_widgets.pyfailure.emails=datateam@whereiwork.comprocess_users.jobtype=commandcommand=python process_users.pyfailure.emails=datateam@whereiwork.com
get_users_widgetsjoin_users_widgets.jobtype=commandcommand=python join_users_widgets.pyfailure.emails=datateam@whereiwork.comdependencies=process_widgets,process_usersexport_to_db.jobtype=commandcommand=python export_to_db.pyfailure.emails=datateam@whereiwork.comdependencies=join_users_widgets
get_users_widgetsget_users_widgets.zip.job.job.job.job.py.py.py.py
You deploy and schedule a job flow as you would a single job.
Hadoop at Meebo: Lessons in the Real World
Hierarchical configurationprocess_widgets.jobtype=commandcommand=python process_widgets.pyfailure.emails=datateam@whereiwork.comThis is silly. Cant I specify failure.emailsglobally?process_users.jobtype=commandcommand=python process_users.pyfailure.emails=datateam@whereiwork.com
温噛一温恢温稼-逮看恢-糸庄姻/壊霞壊岳艶馨.沿姻看沿艶姻岳庄艶壊乙艶岳喝顎壊艶姻壊喝敬庄糸乙艶岳壊/沿姻看界艶壊壊喝敬庄糸乙艶岳壊.逮看恢沿姻看界艶壊壊喝顎壊艶姻壊.逮看恢逮看庄稼喝顎壊艶姻壊喝敬庄糸乙艶岳壊.逮看恢艶恰沿看姻岳喝岳看喝糸恢.逮看恢壊看馨艶-看岳鞄艶姻-逮看恢/
Hierarchical configurationsystem.propertiesfailure.emails=datateam@whereiwork.comdb.url=foo.whereiwork.comarchive.dir=/var/whereiwork/archive
What is type=command?Azkaban supports a few ways to execute jobscommandUnix command in a separate processjavaprocessWrapper to kick off Java programsjavaWrapper to kick off Runnable Java classesCan hook into Azkaban in useful waysPigWrapper to run Pig scripts through Grunt
Whats missing?Scheduling and executing multiple instances of the same job at the same time.
3:00 PMFOO Runs hourly
 3:00 PM took longer than expected4:00 PMFOO
3:00 PMFOO Runs hourly
 3:00 PM failed, restarted at 4:25 PM4:00 PMFOOFOO5:00 PM
Whats missing?Scheduling and executing multiple jobs at the same time.AZK-49, AZK-47Stay tuned for complete, reviewed patch branches: www.github.com/voberoi/azkaban
Whats missing?Scheduling and executing multiple jobs at the same time.AZK-49, AZK-47Stay tuned for complete, reviewed patch branches: www.github.com/voberoi/azkabanPassing arguments between jobs.Write a library used by your jobsPut your arguments anywhere you want
What did we get out of it?No more monolithic wrapper scriptsMassively reduced job setup timeIts configuration, not code!More code reuse, less hair pullingStill porting over jobsIts time consuming
Data Serialization
Whats the problem?Serializing data in simple formats is convenientCSV, XML etc.Problems arise when data changesNeeds backwards-compatibilityDoes this really matter? Lets discuss.
v1clickabutton.comUsername:Password:Go!
Click a Button Analytics PRDWe want to know the number of unique users who clicked on the button.Over an arbitrary range of time.Broken down by whether theyre logged in or not.With hour granularity.
I KNOW!Every hour, process logs and dump lines that look like this to HDFS with Pig:unique_id,logged_in,clicked
I KNOW!--clicked and logged_in are either 0 or 1LOAD $IN USING PigStorage(,) AS (unique_id:chararray,logged_in:int,clicked:int);-- Munge data according to the PRD
v2clickabutton.comUsername:Password:Go!
Click a Button Analytics PRDBreak users down by which button they clicked, too.
I KNOW!Every hour, process logs and dump lines that look like this to HDFS with Pig:unique_id,logged_in,red_click,green_click
I KNOW!--clicked and logged_in are either 0 or 1LOAD $IN USING PigStorage(.) AS (unique_id:chararray,logged_in:int,red_clicked:int,green_clicked:int);-- Munge data according to the PRD
v3clickabutton.comUsername:Password:Go!
H馨馨.
Bad Solution 1Remove red_clickunique_id,logged_in,red_click,green_clickunique_id,logged_in,green_click
Why its badYour script thinks green clicks are red clicks.LOAD $IN USING PigStorage(.) AS (unique_id:chararray,logged_in:int,red_clicked:int,green_clicked:int);-- Munge data according to the PRD
Why its badNow your script wont work for all the data youve collected so far.LOAD $IN USING PigStorage(.) AS (unique_id:chararray,logged_in:int,green_clicked:int);-- Munge data according to the PRD
Ill keep multiple scripts lying around
LOAD $IN USING PigStorage(.) AS (unique_id:chararray,logged_in:int,green_clicked:int);My data has three fields. Which one do I use?LOAD $IN USING PigStorage(.) AS (unique_id:chararray,logged_in:int,orange_clicked:int);
Bad Solution 2Assign a sentinel to red_clickwhen it should be ignored, i.e. -1. unique_id,logged_in,red_click,green_click
Why its badIts a waste of space.
Why its badSticking logic in your data is iffy.
The Preferable SolutionSerialize your data using backwards-compatible data structures!Protocol Buffers and Elephant Bird
Protocol BuffersSerialization systemAvro, ThriftCompiles interfaces to language modulesConstruct a data structureAccess it (in a backwards-compatible way)Ser/deser the data structure in a standard, compact, binary format
uniqueuser.protomessage UniqueUser {optional string id = 1;optional int32 logged_in = 2;optional int32 red_clicked = 3;}.h,.cc.java.py
Elephant BirdGenerate protobuf-based Pig load/store functions + lots moreDeveloped at TwitterBlog posthttp://engineering.twitter.com/2010/04/hadoop-at-twitter.htmlAvailable at:http://www.github.com/kevinweil/elephant-bird
uniqueuser.protomessage UniqueUser {optional string id = 1;optional int32 logged_in = 2;optional int32 red_clicked = 3;}*.pig.load.UniqueUserLzoProtobufB64LinePigLoader*.pig.store.UniqueUserLzoProtobufB64LinePigStorage
LzoProtobufB64?
LzoProtobufB64Serialization(bak49jsn, 0, 1)Protobuf Binary BlobBase64-encoded Protobuf Binary BlobLZO-compressed Base64-encoded Protobuf Binary Blob
LzoProtobufB64Deserialization(bak49jsn, 0, 1)Protobuf Binary BlobBase64-encoded Protobuf Binary BlobLZO-compressed Base64-encoded Protobuf Binary Blob
Setting it upPrereqsProtocol Buffers 2.3+LZO codec for HadoopCheck out docshttp://www.github.com/kevinweil/elephant-bird
Time to revisit
v1clickabutton.comUsername:Password:Go!
Every hour, process logs and dump lines to HDFS that use this protobuf interface:uniqueuser.protomessage UniqueUser {optional string id = 1;optional int32 logged_in = 2;optional int32 red_clicked = 3;}
--clicked and logged_in are either 0 or 1LOAD $IN USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader AS (unique_id:chararray,logged_in:int,red_clicked:int);-- Munge data according to the PRD
v2clickabutton.comUsername:Password:Go!
Every hour, process logs and dump lines to HDFS that use this protobuf interface:uniqueuser.protomessage UniqueUser {optional string id = 1;optional int32 logged_in = 2;optional int32 red_clicked = 3;optional int32 green_clicked = 4;}
--clicked and logged_in are either 0 or 1LOAD $IN USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader AS (unique_id:chararray,logged_in:int,red_clicked:int,green_clicked:int);-- Munge data according to the PRD
v3clickabutton.comUsername:Password:Go!
No need to change your scripts.Theyll work on old and new data!

More Related Content

Hadoop at Meebo: Lessons in the Real World