�ݺ�ߣ

Hadoop at MeeboLessons learned in the real worldVikram OberoiAugust, 2010Hadoop Day, Seattle

About meSDE Intern at Amazon, ’07R&D on item-to-item similaritiesData Engineer Intern at Meebo, ’08Built an A/B testing systemCS at Stanford, ’09Senior project: Ext3 and XFS under HadoopMapReduce workloadsData Engineer at Meebo, ’09—presentData infrastructure, analytics

About MeeboProductsBrowser-based IM client (www.meebo.com)Mobile chat clientsSocial widgets (the Meebo Bar)CompanyFounded 2005Over 100 employees, 30 engineersEngineeringStrong engineering cultureContributions to CouchDB, Lounge, Hadoop components

The ProblemHadoop is powerful technologyMeets today’s demand for big dataBut it’s still a young platformEvolving components and best practicesWith many challenges in real-world usageDay-to-day operational headachesMissing eco-system features (e.g recurring jobs?)Lots of re-inventing the wheel to solve these

Purpose of this talkDiscuss some real problems we’ve seenExplain our solutionsPropose best practices so you can avoid them

What will I talk about?Background:Meebo’s data processing needsMeebo’s pre and post Hadoop data pipelinesLessons:Better workflow managementScheduling, reporting, monitoring, etc.A look at AzkabanGet wiser about data serializationProtocol Buffers (or Avro, or Thrift)

Meebo’s Data Processing Needs

What do we use Hadoop for?ETLAnalyticsBehavioral targetingAd hoc data analysis, researchData produced helps power:internal/external dashboardsour ad server

What kind of data do we have?Log data from all our productsThe Meebo BarMeebo Messenger (www.meebo.com)Android/iPhone/Mobile Web clientsRoomsMeebo MeMeebonotifierFirefox extension

How much data?150MM uniques/month from the Meebo BarAround 200 GB of uncompressed daily logsWe process a subset of our logs

Meebo’s Data PipelinePre and Post Hadoop

A data pipeline in general1. DataCollection2. DataProcessing3. DataStorage4. Workflow Management

Our data pipeline, pre-HadoopServersPython/shell scripts pull log dataPython/shell scripts process dataMySQL, CouchDB, flat filesCron, wrapper shell scripts glue everything together

Our data pipeline post HadoopServersPush logs to HDFSPig scripts process dataMySQL, CouchDB, flat filesAzkaban, a workflow management system, glues everything together

Our transition to using HadoopDeployed early ’09Motivation: processing data took aaaages!Catalyst: Hadoop SummitTurbulent, time consumingNew tools, new paradigms, pitfallsTotally worth it24 hours to process day’s logs  under an hourLeap in ability to analyze our dataBasis for new core product features

What is workflow management?It’s the glue that binds your data pipeline together: scheduling, monitoring, reporting etc.Most people use scripts and cronBut end up spending too much time managingWe need a better way

Workflow management consists of:Executes jobs with arbitrarily complex dependency chains

Split up your jobs into discrete chunks with dependencies Minimize impact when chunks fail

Allow engineers to work on chunks separately

Monolithic scripts are no fun Clean up data from log AProcess data from log BJoin data, train a classifierPost-processingArchive outputExport to DB somewhere

Workflow management consists of:Executes jobs with arbitrarily complex dependency chainsSchedules recurring jobs to run at a given time

Workflow management consists of:Executes jobs with arbitrarily complex dependency chainsSchedules recurring jobs to run at a given time Monitors job progress

Workflow management consists of:Executes jobs with arbitrarily complex dependency chainsSchedules recurring jobs to run at a given time Monitors job progressReports when job fails, how long jobs take

Workflow management consists of:Executes jobs with arbitrarily complex dependency chainsSchedules recurring jobs to run at a given time Monitors job progressReports when job fails, how long jobs takeLogs job execution and exposes logs so that engineers can deal with failures swiftly

Workflow management consists of:Executes jobs with arbitrarily complex dependency chainsSchedules recurring jobs to run at a given time Monitors job progressReports when job fails, how long jobs takeLogs job execution and exposes logs so that engineers can deal with failures swiftlyProvides resource management capabilities

Export to DB somewhereExport to DB somewhereExport to DB somewhereExport to DB somewhereExport to DB somewhereDB somewhereDon’t DoS yourself

Export to DB somewhereExport to DB somewhereExport to DB somewhereExport to DB somewhereExport to DB somewhere21000Permit ManagerDB somewhere

Don’t roll your own scheduler!Building a good scheduling framework is hardMyriad of small requirements, precise bookkeeping with many edge casesMany roll their ownIt’s usually inadequateSo much repeated effort!Mold an existing framework to your requirements and contribute

Two emerging frameworksOozieBuilt at YahooOpen-sourced at Hadoop Summit ’10Used in production for [don’t know]Packaged by ClouderaAzkabanBuilt at LinkedInOpen-sourced in March ‘10Used in production for over nine months as of March ’10Now in use at Meebo

Hadoop at Meebo: Lessons in the Real World

Azkaban jobs are bundles of configuration and code

Configuring a jobprocess_log_data.jobtype=commandcommand=python process_logs.pyfailure.emails=datateam@whereiwork.comprocess_logs.pyimportosimport sys# Do useful things…

Deploying a jobStep 1: Shove your config and code into a zip archive.process_log_data.zip.job.py

Deploying a jobStep 2: Upload to Azkabanprocess_log_data.zip.job.py

Scheduling a jobThe Azkaban front-end:

get_users_widgetsprocess_widgets.jobprocess_users.jobjoin_users_widgets.jobexport_to_db.job

get_users_widgetsprocess_widgets.jobtype=commandcommand=python process_widgets.pyfailure.emails=datateam@whereiwork.comprocess_users.jobtype=commandcommand=python process_users.pyfailure.emails=datateam@whereiwork.com

get_users_widgetsjoin_users_widgets.jobtype=commandcommand=python join_users_widgets.pyfailure.emails=datateam@whereiwork.comdependencies=process_widgets,process_usersexport_to_db.jobtype=commandcommand=python export_to_db.pyfailure.emails=datateam@whereiwork.comdependencies=join_users_widgets

get_users_widgetsget_users_widgets.zip.job.job.job.job.py.py.py.py

You deploy and schedule a job flow as you would a single job.

Hierarchical configurationprocess_widgets.jobtype=commandcommand=python process_widgets.pyfailure.emails=datateam@whereiwork.comThis is silly. Can‘t I specify failure.emailsglobally?process_users.jobtype=commandcommand=python process_users.pyfailure.emails=datateam@whereiwork.com

��첹��-��ǲ�-�徱��/��ٱ�.��Ǳ��پ��ٳ�ܲ��ɾ��岵��ٲ�/��ǳ��ɾ��岵��ٲ�.��ǲ��ǳ��ܲ��.��ǲ��Ǿ��Գ�ܲ��ɾ��岵��ٲ�.��ǲ��ǰ��ٳ�ٴǳ��.��ǲ��ǳ��-�ǳٳ��-��ǲ�/…

Hierarchical configurationsystem.propertiesfailure.emails=datateam@whereiwork.comdb.url=foo.whereiwork.comarchive.dir=/var/whereiwork/archive

What is type=command?Azkaban supports a few ways to execute jobscommandUnix command in a separate processjavaprocessWrapper to kick off Java programsjavaWrapper to kick off Runnable Java classesCan hook into Azkaban in useful waysPigWrapper to run Pig scripts through Grunt

What’s missing?Scheduling and executing multiple instances of the same job at the same time.

3:00 PM took longer than expected4:00 PMFOO

3:00 PM failed, restarted at 4:25 PM4:00 PMFOOFOO5:00 PM

What’s missing?Scheduling and executing multiple jobs at the same time.AZK-49, AZK-47Stay tuned for complete, reviewed patch branches: www.github.com/voberoi/azkaban

What’s missing?Scheduling and executing multiple jobs at the same time.AZK-49, AZK-47Stay tuned for complete, reviewed patch branches: www.github.com/voberoi/azkabanPassing arguments between jobs.Write a library used by your jobsPut your arguments anywhere you want

What did we get out of it?No more monolithic wrapper scriptsMassively reduced job setup timeIt’s configuration, not code!More code reuse, less hair pullingStill porting over jobsIt’s time consuming

What’s the problem?Serializing data in simple formats is convenientCSV, XML etc.Problems arise when data changesNeeds backwards-compatibilityDoes this really matter? Let’s discuss.

v1clickabutton.comUsername:Password:Go!

“Click a Button” Analytics PRDWe want to know the number of unique users who clicked on the button.Over an arbitrary range of time.Broken down by whether they’re logged in or not.With hour granularity.

“I KNOW!”Every hour, process logs and dump lines that look like this to HDFS with Pig:unique_id,logged_in,clicked

“I KNOW!”--‘clicked’ and ‘logged_in’ are either 0 or 1LOAD ‘$IN’ USING PigStorage(‘,’) AS (unique_id:chararray,logged_in:int,clicked:int);-- Munge data according to the PRD…

“Click a Button” Analytics PRDBreak users down by which button they clicked, too.

“I KNOW!”Every hour, process logs and dump lines that look like this to HDFS with Pig:unique_id,logged_in,red_click,green_click

“I KNOW!”--‘clicked’ and ‘logged_in’ are either 0 or 1LOAD ‘$IN’ USING PigStorage(‘.’) AS (unique_id:chararray,logged_in:int,red_clicked:int,green_clicked:int);-- Munge data according to the PRD…

Bad Solution 1Remove red_clickunique_id,logged_in,red_click,green_clickunique_id,logged_in,green_click

Why it’s badYour script thinks green clicks are red clicks.LOAD ‘$IN’ USING PigStorage(‘.’) AS (unique_id:chararray,logged_in:int,red_clicked:int,green_clicked:int);-- Munge data according to the PRD…

Why it’s badNow your script won’t work for all the data you’ve collected so far.LOAD ‘$IN’ USING PigStorage(‘.’) AS (unique_id:chararray,logged_in:int,green_clicked:int);-- Munge data according to the PRD…

“I’ll keep multiple scripts lying around”

LOAD ‘$IN’ USING PigStorage(‘.’) AS (unique_id:chararray,logged_in:int,green_clicked:int);My data has three fields. Which one do I use?LOAD ‘$IN’ USING PigStorage(‘.’) AS (unique_id:chararray,logged_in:int,orange_clicked:int);

Bad Solution 2Assign a sentinel to red_clickwhen it should be ignored, i.e. -1. unique_id,logged_in,red_click,green_click

Why it’s badIt’s a waste of space.

Why it’s badSticking logic in your data is iffy.

The Preferable SolutionSerialize your data using backwards-compatible data structures!Protocol Buffers and Elephant Bird

Protocol BuffersSerialization systemAvro, ThriftCompiles interfaces to language modulesConstruct a data structureAccess it (in a backwards-compatible way)Ser/deser the data structure in a standard, compact, binary format

uniqueuser.protomessage UniqueUser {optional string id = 1;optional int32 logged_in = 2;optional int32 red_clicked = 3;}.h,.cc.java.py

Elephant BirdGenerate protobuf-based Pig load/store functions + lots moreDeveloped at TwitterBlog posthttp://engineering.twitter.com/2010/04/hadoop-at-twitter.htmlAvailable at:http://www.github.com/kevinweil/elephant-bird

uniqueuser.protomessage UniqueUser {optional string id = 1;optional int32 logged_in = 2;optional int32 red_clicked = 3;}*.pig.load.UniqueUserLzoProtobufB64LinePigLoader*.pig.store.UniqueUserLzoProtobufB64LinePigStorage

LzoProtobufB64Serialization(bak49jsn, 0, 1)Protobuf Binary BlobBase64-encoded Protobuf Binary BlobLZO-compressed Base64-encoded Protobuf Binary Blob

LzoProtobufB64Deserialization(bak49jsn, 0, 1)Protobuf Binary BlobBase64-encoded Protobuf Binary BlobLZO-compressed Base64-encoded Protobuf Binary Blob

Setting it upPrereqsProtocol Buffers 2.3+LZO codec for HadoopCheck out docshttp://www.github.com/kevinweil/elephant-bird

Every hour, process logs and dump lines to HDFS that use this protobuf interface:uniqueuser.protomessage UniqueUser {optional string id = 1;optional int32 logged_in = 2;optional int32 red_clicked = 3;}

--‘clicked’ and ‘logged_in’ are either 0 or 1LOAD ‘$IN’ USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader AS (unique_id:chararray,logged_in:int,red_clicked:int);-- Munge data according to the PRD…

Every hour, process logs and dump lines to HDFS that use this protobuf interface:uniqueuser.protomessage UniqueUser {optional string id = 1;optional int32 logged_in = 2;optional int32 red_clicked = 3;optional int32 green_clicked = 4;}

--‘clicked’ and ‘logged_in’ are either 0 or 1LOAD ‘$IN’ USING myudfs.pig.load.UniqueUserLzoProtobufB64LinePigLoader AS (unique_id:chararray,logged_in:int,red_clicked:int,green_clicked:int);-- Munge data according to the PRD…

No need to change your scripts.They’ll work on old and new data!

�ݺ�ߣ

Hadoop at Meebo: Lessons in the Real World

More Related Content

Hadoop at Meebo: Lessons in the Real World