�ݺ�ߣ

River
A data workflow management system

Harel Ben Attia
Senior Software Engineer

– Tens of Billions of Recommendations per month
– Most major publishers in the World
– Hundreds GBs of new data every day

Context
• Data Processing Workflows

• Multiple Types of Processing
– Rollups, Grouping, Filtering, Algorithm
Calculations

• Multiple Stages of Processing
– Using the output of other processes as input

Problems
• Dependency “Management”
– Hardcoded into code/scripts
– Time-based using cron or another scheduler

• Logic is scattered around the system
– Developers need to take care of
monitoring, alerts, permissions etc.
– Multiple Locations of Execution

River

Data Processing
Management
Infrastructure

River
• Execution Management
– Full Execution History and Filtering
– Monitoring and Actionable Alerting Ops / NOC
– Automatic Retries
– Web UI

• Ease of Development
– Declarative Data Processing Definitions
– Decentralized Developers
• Shared Data, separate development
– JobLogs

• Data Driven Dependencies
– Why?

Other Approaches

A B C
Option 1 Option 2

A B C t A J B C
J

Other Approaches

Option 2

A J B C t

Other Approaches

D Fails
D sends email

Developer of D
still works here

Where is the code?

Other Approaches

2am is a
D= great hour for
troubleshooting!
Data from C is missing…

C= The data of C
is all there!

Other Approaches

X:37 seems like a
good time… C never
finished after X:30
anyway

A B C t

Job J has been working for
more than a week before
the incident
D …

Other Approaches

Need to rerun processes B, C and D

•Which hours failed?

•How to run all of them for the specific hours?
•Without running A again?
•Without colliding with ongoing executions?

Other Approaches
“A will never take more
than 15 minutes, so X:20 is more than enough”

A
X:00
t

J
A WILL eventually take longer

River
• Execution Management
– Full Execution History + Filtering and Searching
– Monitoring and Actionable Alerting
– Automatic Retries
– Web UI
– JobLogs

• Ease of Development
– Declarative Data Processing Definitions
– Decentralized
• Shared Data, separate development

• Data Driven Dependencies
– Why?
Robustness Reliability Parallelism

River

What? When?

Where? How?

Execution Layer – the “What”
Every data processing task is called a Job

A Job can contain multiple Steps

• Importing from MySQL to Hive
• Hive Queries
• JDBC Queries
• Transfer data from Hive into MySQL and to Cassandra
• Running External Commands:
MapReduce, Java, bash, Legacy code, etc.

Jobs use Parameters

Scheduling Layer – the “When”
Each job registers to an event, which will trigger its execution

Each job emits an event at job completion

Events that describe Data Availability Events that are time dependent

The “How” and the “Where”
Both handled by the infrastructure

• Integration to other systems
• Connecting to Hive/Hadoop/Cassandra Logical names to
all data sources
• Connecting to JDBC Databases
“readOnlyDataWarehouse”
• Retries, throttling, timeouts ”productionCassandra”

• Monitoring and Alerts Centralized Management, email
notifications and dashboards

• Location of Execution Actual location is hidden from the
developer/ops

River UI

FailDownload JobLog
Job and Dependents
Restart Job

Steps

Copy Data From JDBC to Hive
sourceDB = “productionDatabase”
sourceTable = “myRawData”
targetCluster = “onlineHadoopCluster”
targetHiveTable = “rawDataTable”
Filter = “date=#handledDate#”

Steps only contain what needs to be done

A bit more about triggers
Triggers have parameters as well

Date=2012-10-10,hour=15 Date=2012-10-10,hour=19

Parameters Propagate through jobs and to other triggers

Developer’s Point-of-View

Automatic
Retries
Parameters
Pass-through

Trigger Queue Execution Queue

River
Trigger Execution Spring
Manager Manager Batch
Topology Spring Batch DB

Hive/Hadoop OS Cassandra JDBC
Interface Interface Inerface Interface

External
Systems

Dependencies
for detailed example

Trigger Queue Execution Queue
Date=2012-01-02 T1 T2
T3 Job1,Job2
Job3
hour=03 Date=2012-01-02
Date=2012-01-02
hour=03
hour=03
Job1
Job2
River T1
T3
T2 Job3
Job1,Job2

Job1,Job2
Job3
(from Job1) (from Job2)


External
Systems
Success Example

UI
Trigger Queue Execution Queue Job2
T3 Job3 Job2
Date=2012-01-02 Date=2012-01-02
hour=03 hour=03
Job2
Job2
River T3 Job2
Job3

Job3


External
Systems
Failure Example

Notable Features
• Parameter Enrichment
– Example: #beginningOfMonth

• Precondition Expressions
– Example: isLastDayOfMonth(#handleDate)

• Data Comparison Capabilities
– Data Validations
– Supports Tolerance
• Absolute and Percentage margins

• Command Line and Java Clients

River at
• 6 River Instances Running
• 5 Teams
• ~4100 Jobs running every day
• ~50 Different Job Types

• Job Failures due to environment issues have
almost no overhead
• Automatic restarts of jobs when data arrives late

Illustration by Chris Whetzel

Future Plans
• Multiple Dependencies
• Offline Job Testing Capabilities
• Improved DSL for Job Definitions
• Support for Master/Worker River machines
• Job Priorities
• Analysis Tools

Outbrain is working on Open Sourcing River

Thank You

Harel Ben Attia @harelba on Twitter
harel@outbrain.com http://www.linkedin.com/in/harelba

�ݺ�ߣ

Outbrain River Presentation at Reversim Summit 2013

Recommended

More Related Content

Viewers also liked (9)

Similar to Outbrain River Presentation at Reversim Summit 2013 (20)

Outbrain River Presentation at Reversim Summit 2013

Editor's Notes