際際滷

際際滷Share a Scribd company logo
River
A data workflow management system


Harel Ben Attia
Senior Software Engineer
 Tens of Billions of Recommendations per month
 Most major publishers in the World
 Hundreds GBs of new data every day
Context
 Data Processing Workflows

 Multiple Types of Processing
   Rollups, Grouping, Filtering, Algorithm
    Calculations


 Multiple Stages of Processing
   Using the output of other processes as input
Problems
 Dependency Management
   Hardcoded into code/scripts
   Time-based using cron or another scheduler


 Logic is scattered around the system
   Developers need to take care of
    monitoring, alerts, permissions etc.
   Multiple Locations of Execution
River


Data Processing
 Management
 Infrastructure
River
 Execution Management
      Full Execution History and Filtering
      Monitoring and Actionable Alerting      Ops / NOC
      Automatic Retries
      Web UI

 Ease of Development
    Declarative Data Processing Definitions
    Decentralized                             Developers
         Shared Data, separate development
    JobLogs

 Data Driven Dependencies
    Why?
Other Approaches




      A                B           C
         Option 1              Option 2



A          B C         t   A   J     B    C
            J
Other Approaches



                   Option 2



   A J B C                    t
Other Approaches




               D Fails
D sends email

Developer of D
still works here


        Where is the code?
Other Approaches



                     2am is a
D=                   great hour for
                     troubleshooting!
       Data from C is missing


C=                   The data of C
                     is all there!
Other Approaches


                               X:37 seems like a
                             good time C never
                              finished after X:30
                                    anyway




A B C                                               t


Job J has been working for
 more than a week before
        the incident
                                      D
Other Approaches


Need to rerun processes B, C and D



Which hours failed?

How to run all of them for the specific hours?
  Without running A again?
  Without colliding with ongoing executions?
Other Approaches
                       A will never take more
            than 15 minutes, so X:20 is more than enough



           A
        X:00
                                                            t


                                      J
A WILL eventually take longer
River
 Execution Management
      Full Execution History + Filtering and Searching
      Monitoring and Actionable Alerting
      Automatic Retries
      Web UI
      JobLogs

 Ease of Development
    Declarative Data Processing Definitions
    Decentralized
         Shared Data, separate development


 Data Driven Dependencies
    Why?
                 Robustness              Reliability      Parallelism
River

What?            When?




Where?           How?
Execution Layer  the What
       Every data processing task is called a Job

           A Job can contain multiple Steps

   Importing from MySQL to Hive
   Hive Queries
   JDBC Queries
   Transfer data from Hive into MySQL and to Cassandra
   Running External Commands:
    MapReduce, Java, bash, Legacy code, etc.

                  Jobs use Parameters
Scheduling Layer  the When
        Each job registers to an event, which will trigger its execution

                 Each job emits an event at job completion




Events that describe Data Availability            Events that are time dependent
The How and the Where
             Both handled by the infrastructure

 Integration to other systems
    Connecting to Hive/Hadoop/Cassandra        Logical names to
                                                 all data sources
    Connecting to JDBC Databases
                                      readOnlyDataWarehouse
    Retries, throttling, timeouts    productionCassandra


 Monitoring and Alerts               Centralized Management, email
                                       notifications and dashboards


 Location of Execution              Actual location is hidden from the
                                               developer/ops
River UI




           FailDownload JobLog
               Job and Dependents
                  Restart Job
Monitoring Dashboard
Monitoring Dashboard
Steps

Copy Data From JDBC to Hive
sourceDB = productionDatabase
sourceTable = myRawData
targetCluster = onlineHadoopCluster
targetHiveTable = rawDataTable
Filter = date=#handledDate#




     Steps only contain what needs to be done
A bit more about triggers
            Triggers have parameters as well

          Date=2012-10-10,hour=15              Date=2012-10-10,hour=19




Parameters Propagate through jobs and to other triggers
Developers Point-of-View


 Automatic
   Retries
 Parameters
Pass-through
Trigger Queue                Execution Queue




           River
                      Trigger               Execution       Spring
                     Manager                Manager         Batch
Topology                                                               Spring Batch DB



            Hive/Hadoop            OS       Cassandra        JDBC
              Interface         Interface    Inerface      Interface




                                   External
                                   Systems
Dependencies
for detailed example
Trigger Queue                               Execution Queue
      Date=2012-01-02   T1                           T2
                                                     T3         Job1,Job2
                                                                Job3
          hour=03                                         Date=2012-01-02
                                                     Date=2012-01-02
                                                              hour=03
                                                         hour=03
                                                                                                  Job1
                                                                                                  Job2
                        River                 T1
                                              T3
                                              T2                                   Job3
                                                                                   Job1,Job2


                                   Trigger                         Execution             Spring
                                  Manager                          Manager               Batch
           Job1,Job2
           Job3
Topology                                                                                                 Spring Batch DB
                                       (from Job1)                                       (from Job2)


                         Hive/Hadoop               OS              Cassandra              JDBC
                           Interface           Interface            Inerface            Interface




                                                   External
                                                   Systems
Success Example
UI
                          Trigger Queue                      Execution Queue                             Job2
                                           T3         Job3                              Job2
                                           Date=2012-01-02                      Date=2012-01-02
                                               hour=03                              hour=03
                                                                                      Job2
                                                                                      Job2
                  River             T3                               Job2
                                                                     Job3


                             Trigger                     Execution           Spring
                            Manager                      Manager             Batch
           Job3
Topology                                                                                          Spring Batch DB



                   Hive/Hadoop            OS             Cassandra            JDBC
                     Interface         Interface          Inerface          Interface




                                          External
                                          Systems
Failure Example
Notable Features
 Parameter Enrichment
    Example: #beginningOfMonth

 Precondition Expressions
    Example: isLastDayOfMonth(#handleDate)

 Data Comparison Capabilities
    Data Validations
    Supports Tolerance
       Absolute and Percentage margins


 Command Line and Java Clients
River at
   6 River Instances Running
   5 Teams
   ~4100 Jobs running every day
   ~50 Different Job Types

 Job Failures due to environment issues have
  almost no overhead
 Automatic restarts of jobs when data arrives late
Illustration by Chris Whetzel


                 Future Plans
   Multiple Dependencies
   Offline Job Testing Capabilities
   Improved DSL for Job Definitions
   Support for Master/Worker River machines
   Job Priorities
   Analysis Tools


    Outbrain is working on Open Sourcing River
Questions
Thank You


Harel Ben Attia                    @harelba on Twitter
harel@outbrain.com   http://www.linkedin.com/in/harelba

More Related Content

Viewers also liked (9)

DeskPro - AFPC patented UPS System
DeskPro - AFPC patented UPS SystemDeskPro - AFPC patented UPS System
DeskPro - AFPC patented UPS System
Mohammed Aggabi
Hybrid Solar Inverter 5kVA/4.2kW/48VDC/400Ah Off-Grid Generation
Hybrid Solar Inverter 5kVA/4.2kW/48VDC/400Ah Off-Grid GenerationHybrid Solar Inverter 5kVA/4.2kW/48VDC/400Ah Off-Grid Generation
Hybrid Solar Inverter 5kVA/4.2kW/48VDC/400Ah Off-Grid Generation
Mohammed Aggabi
Scorpions by Kruchinina Nadya
Scorpions by Kruchinina NadyaScorpions by Kruchinina Nadya
Scorpions by Kruchinina Nadya
aesc-msu
Proposta e estatisticas sakateampro english
Proposta e estatisticas sakateampro englishProposta e estatisticas sakateampro english
Proposta e estatisticas sakateampro english
Jefferson Carneiro
I brid3.0
I brid3.0I brid3.0
I brid3.0
Mohammed Aggabi
#OrtechPowerSolutions #Consulting Episode1
#OrtechPowerSolutions #Consulting Episode1#OrtechPowerSolutions #Consulting Episode1
#OrtechPowerSolutions #Consulting Episode1
Mohammed Aggabi
Musics influence by Mikheyeva L., Knyazeva A.
Musics influence by Mikheyeva L., Knyazeva A. Musics influence by Mikheyeva L., Knyazeva A.
Musics influence by Mikheyeva L., Knyazeva A.
aesc-msu
Etude de Cas- Traite Vaches laiti竪res
Etude de Cas-  Traite Vaches laiti竪resEtude de Cas-  Traite Vaches laiti竪res
Etude de Cas- Traite Vaches laiti竪res
Mohammed Aggabi
Splin by Zamyatina Ekaterina
Splin by Zamyatina Ekaterina Splin by Zamyatina Ekaterina
Splin by Zamyatina Ekaterina
aesc-msu
DeskPro - AFPC patented UPS System
DeskPro - AFPC patented UPS SystemDeskPro - AFPC patented UPS System
DeskPro - AFPC patented UPS System
Mohammed Aggabi
Hybrid Solar Inverter 5kVA/4.2kW/48VDC/400Ah Off-Grid Generation
Hybrid Solar Inverter 5kVA/4.2kW/48VDC/400Ah Off-Grid GenerationHybrid Solar Inverter 5kVA/4.2kW/48VDC/400Ah Off-Grid Generation
Hybrid Solar Inverter 5kVA/4.2kW/48VDC/400Ah Off-Grid Generation
Mohammed Aggabi
Scorpions by Kruchinina Nadya
Scorpions by Kruchinina NadyaScorpions by Kruchinina Nadya
Scorpions by Kruchinina Nadya
aesc-msu
Proposta e estatisticas sakateampro english
Proposta e estatisticas sakateampro englishProposta e estatisticas sakateampro english
Proposta e estatisticas sakateampro english
Jefferson Carneiro
#OrtechPowerSolutions #Consulting Episode1
#OrtechPowerSolutions #Consulting Episode1#OrtechPowerSolutions #Consulting Episode1
#OrtechPowerSolutions #Consulting Episode1
Mohammed Aggabi
Musics influence by Mikheyeva L., Knyazeva A.
Musics influence by Mikheyeva L., Knyazeva A. Musics influence by Mikheyeva L., Knyazeva A.
Musics influence by Mikheyeva L., Knyazeva A.
aesc-msu
Etude de Cas- Traite Vaches laiti竪res
Etude de Cas-  Traite Vaches laiti竪resEtude de Cas-  Traite Vaches laiti竪res
Etude de Cas- Traite Vaches laiti竪res
Mohammed Aggabi
Splin by Zamyatina Ekaterina
Splin by Zamyatina Ekaterina Splin by Zamyatina Ekaterina
Splin by Zamyatina Ekaterina
aesc-msu

Similar to Outbrain River Presentation at Reversim Summit 2013 (20)

MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
Lee Stott
From ddd to DDD : My journey from data-driven development to Domain-Driven De...
From ddd to DDD : My journey from data-driven development to Domain-Driven De...From ddd to DDD : My journey from data-driven development to Domain-Driven De...
From ddd to DDD : My journey from data-driven development to Domain-Driven De...
Thibaud Desodt
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Ovidiu Dimulescu
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013
Nathan Bijnens
Nov 2011 HUG: Oozie
Nov 2011 HUG: Oozie Nov 2011 HUG: Oozie
Nov 2011 HUG: Oozie
Yahoo Developer Network
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
Batching and Java EE (jdk.io)
Batching and Java EE (jdk.io)Batching and Java EE (jdk.io)
Batching and Java EE (jdk.io)
Ryan Cuprak
Oozie hugnov11
Oozie hugnov11Oozie hugnov11
Oozie hugnov11
mislam77
Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221a
Schubert Zhang
Partitioning CCGrid 2012
Partitioning CCGrid 2012Partitioning CCGrid 2012
Partitioning CCGrid 2012
Weiwei Chen
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
Oozie sweet
Oozie sweetOozie sweet
Oozie sweet
mislam77
Apache Tez Present and Future
Apache Tez  Present and FutureApache Tez  Present and Future
Apache Tez Present and Future
DataWorks Summit
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
John Adams
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
High Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of ViewHigh Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of View
aragozin
Massively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard wayMassively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard way
J On The Beach
Scientific Applications of The Data Distribution Service
Scientific Applications of The Data Distribution ServiceScientific Applications of The Data Distribution Service
Scientific Applications of The Data Distribution Service
Angelo Corsaro
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
liujianrong
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
Roger Xia
MEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop MicrosoftMEW22 22nd Machine Evaluation Workshop Microsoft
MEW22 22nd Machine Evaluation Workshop Microsoft
Lee Stott
From ddd to DDD : My journey from data-driven development to Domain-Driven De...
From ddd to DDD : My journey from data-driven development to Domain-Driven De...From ddd to DDD : My journey from data-driven development to Domain-Driven De...
From ddd to DDD : My journey from data-driven development to Domain-Driven De...
Thibaud Desodt
Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013Microsoft Big Data @ SQLUG 2013
Microsoft Big Data @ SQLUG 2013
Nathan Bijnens
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
Batching and Java EE (jdk.io)
Batching and Java EE (jdk.io)Batching and Java EE (jdk.io)
Batching and Java EE (jdk.io)
Ryan Cuprak
Oozie hugnov11
Oozie hugnov11Oozie hugnov11
Oozie hugnov11
mislam77
Hanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221aHanborq optimizations on hadoop map reduce 20120221a
Hanborq optimizations on hadoop map reduce 20120221a
Schubert Zhang
Partitioning CCGrid 2012
Partitioning CCGrid 2012Partitioning CCGrid 2012
Partitioning CCGrid 2012
Weiwei Chen
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
Pallav Jha
Oozie sweet
Oozie sweetOozie sweet
Oozie sweet
mislam77
Apache Tez Present and Future
Apache Tez  Present and FutureApache Tez  Present and Future
Apache Tez Present and Future
DataWorks Summit
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
John Adams
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
High Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of ViewHigh Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of View
aragozin
Massively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard wayMassively scalable ETL in real world applications: the hard way
Massively scalable ETL in real world applications: the hard way
J On The Beach
Scientific Applications of The Data Distribution Service
Scientific Applications of The Data Distribution ServiceScientific Applications of The Data Distribution Service
Scientific Applications of The Data Distribution Service
Angelo Corsaro
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
liujianrong
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
Roger Xia

Outbrain River Presentation at Reversim Summit 2013

  • 1. River A data workflow management system Harel Ben Attia Senior Software Engineer
  • 2. Tens of Billions of Recommendations per month Most major publishers in the World Hundreds GBs of new data every day
  • 3. Context Data Processing Workflows Multiple Types of Processing Rollups, Grouping, Filtering, Algorithm Calculations Multiple Stages of Processing Using the output of other processes as input
  • 4. Problems Dependency Management Hardcoded into code/scripts Time-based using cron or another scheduler Logic is scattered around the system Developers need to take care of monitoring, alerts, permissions etc. Multiple Locations of Execution
  • 6. River Execution Management Full Execution History and Filtering Monitoring and Actionable Alerting Ops / NOC Automatic Retries Web UI Ease of Development Declarative Data Processing Definitions Decentralized Developers Shared Data, separate development JobLogs Data Driven Dependencies Why?
  • 7. Other Approaches A B C Option 1 Option 2 A B C t A J B C J
  • 8. Other Approaches Option 2 A J B C t
  • 9. Other Approaches D Fails D sends email Developer of D still works here Where is the code?
  • 10. Other Approaches 2am is a D= great hour for troubleshooting! Data from C is missing C= The data of C is all there!
  • 11. Other Approaches X:37 seems like a good time C never finished after X:30 anyway A B C t Job J has been working for more than a week before the incident D
  • 12. Other Approaches Need to rerun processes B, C and D Which hours failed? How to run all of them for the specific hours? Without running A again? Without colliding with ongoing executions?
  • 13. Other Approaches A will never take more than 15 minutes, so X:20 is more than enough A X:00 t J A WILL eventually take longer
  • 14. River Execution Management Full Execution History + Filtering and Searching Monitoring and Actionable Alerting Automatic Retries Web UI JobLogs Ease of Development Declarative Data Processing Definitions Decentralized Shared Data, separate development Data Driven Dependencies Why? Robustness Reliability Parallelism
  • 15. River What? When? Where? How?
  • 16. Execution Layer the What Every data processing task is called a Job A Job can contain multiple Steps Importing from MySQL to Hive Hive Queries JDBC Queries Transfer data from Hive into MySQL and to Cassandra Running External Commands: MapReduce, Java, bash, Legacy code, etc. Jobs use Parameters
  • 17. Scheduling Layer the When Each job registers to an event, which will trigger its execution Each job emits an event at job completion Events that describe Data Availability Events that are time dependent
  • 18. The How and the Where Both handled by the infrastructure Integration to other systems Connecting to Hive/Hadoop/Cassandra Logical names to all data sources Connecting to JDBC Databases readOnlyDataWarehouse Retries, throttling, timeouts productionCassandra Monitoring and Alerts Centralized Management, email notifications and dashboards Location of Execution Actual location is hidden from the developer/ops
  • 19. River UI FailDownload JobLog Job and Dependents Restart Job
  • 22. Steps Copy Data From JDBC to Hive sourceDB = productionDatabase sourceTable = myRawData targetCluster = onlineHadoopCluster targetHiveTable = rawDataTable Filter = date=#handledDate# Steps only contain what needs to be done
  • 23. A bit more about triggers Triggers have parameters as well Date=2012-10-10,hour=15 Date=2012-10-10,hour=19 Parameters Propagate through jobs and to other triggers
  • 24. Developers Point-of-View Automatic Retries Parameters Pass-through
  • 25. Trigger Queue Execution Queue River Trigger Execution Spring Manager Manager Batch Topology Spring Batch DB Hive/Hadoop OS Cassandra JDBC Interface Interface Inerface Interface External Systems
  • 27. Trigger Queue Execution Queue Date=2012-01-02 T1 T2 T3 Job1,Job2 Job3 hour=03 Date=2012-01-02 Date=2012-01-02 hour=03 hour=03 Job1 Job2 River T1 T3 T2 Job3 Job1,Job2 Trigger Execution Spring Manager Manager Batch Job1,Job2 Job3 Topology Spring Batch DB (from Job1) (from Job2) Hive/Hadoop OS Cassandra JDBC Interface Interface Inerface Interface External Systems Success Example
  • 28. UI Trigger Queue Execution Queue Job2 T3 Job3 Job2 Date=2012-01-02 Date=2012-01-02 hour=03 hour=03 Job2 Job2 River T3 Job2 Job3 Trigger Execution Spring Manager Manager Batch Job3 Topology Spring Batch DB Hive/Hadoop OS Cassandra JDBC Interface Interface Inerface Interface External Systems Failure Example
  • 29. Notable Features Parameter Enrichment Example: #beginningOfMonth Precondition Expressions Example: isLastDayOfMonth(#handleDate) Data Comparison Capabilities Data Validations Supports Tolerance Absolute and Percentage margins Command Line and Java Clients
  • 30. River at 6 River Instances Running 5 Teams ~4100 Jobs running every day ~50 Different Job Types Job Failures due to environment issues have almost no overhead Automatic restarts of jobs when data arrives late
  • 31. Illustration by Chris Whetzel Future Plans Multiple Dependencies Offline Job Testing Capabilities Improved DSL for Job Definitions Support for Master/Worker River machines Job Priorities Analysis Tools Outbrain is working on Open Sourcing River
  • 33. Thank You Harel Ben Attia @harelba on Twitter harel@outbrain.com http://www.linkedin.com/in/harelba

Editor's Notes