This document discusses Oozie, a workflow scheduler system for Hadoop. It describes Oozie's role in coordinating and scheduling Hadoop jobs such as Pig, Hive, and MapReduce. The next release of Oozie will integrate better with Hadoop 0.23 and HCatalog, and add new features like script and Distcp actions. Challenges include addressing queue starvation when suspending or killing jobs. The future of Oozie includes easier adoption, job notifications through JMS, and event-based data processing.
5. A Workflow Engine
≒ Oozie executes workflow defined as DAG of jobs
≒ The job type includes: Map-Reduce/Pig/Hive/Any script/
Custom Java Code etc
M/R
streaming
job
M/R
start fork join
job
Pig MORE
decision
job
M/R ENOUGH
job
FS
end Java
job
6. A Scheduler
≒ Oozie executes workflow based on:
Time Dependency (Frequency)
Data Dependency
Oozie Server
Check
WS API Oozie Data Availability
Coordinator
Oozie
Oozie Workflow
Client Hadoop
7. REST-API for Hadoop Components
≒ Direct access to Hadoop components
Emulates the command line through REST
API.
≒ Supported Products:
Pig
Map Reduce
8. Three Questions
Do you need Oozie?
Q1 : Do you have multiple jobs with
dependency?
Q2 : Does your job start based on time or data
availability?
Q3 : Do you need monitoring and operational
support for your jobs?
If any one of your answers is YES,
then you should consider Oozie!
9. What Oozie is NOT
≒ Oozie is not a resource scheduler
≒ Oozie is not for off-grid scheduling
o Note: Off-grid execution is possible through
SSH action.
≒ If you want to submit your job occasionally,
Oozie is an option.
o Oozie provides REST API based submission.
11. Oozie in Apache
≒ Y! internal usages:
Total number of user : 375
Total number of processed jobs 750K/
month
≒ External downloads:
2500+ in last year from GitHub
A large number of downloads maintained by
3rd party packaging.
13. Next Release
≒ Integration with Hadoop 0.23
≒ HCatalog integration
Non-polling approach
14. Usability
≒ Script Action
≒ Distcp Action
≒ Suspend Action
≒ Mini-Oozie for CI
Like Mini-cluster
≒ Support multiple versions
Pig, Distcp, Hive etc.
16. Manageability
≒ Email action
≒ Query Pig Stats/Hadoop Counters
Runtime control of Workflow based on stats
Application-level control using the stats
17. Challenges : Queue Starvation
≒ Which Queue?
Not a Hadoop queue issue.
Oozie internal queue to process the Oozie
sub-tasks.
Oozies main execution engine.
≒ User Problem :
Jobs kill/suspend takes very long time.
18. Challenges : Queue Starvation
Technical Problem:
≒ Before execution, every task acquires lock on the job id.
≒ Specialhigh-priority tasks (such as Kill or Suspend)
couldnt get the lock and therefore, starve.
In Queue J1 J2
J1 J1 J2 J1(H) J2 J1
Starvation for High Priority Task!
19. Challenges : Queue Starvation
Resolution:
≒Add the high priority task in both the interrupt list and normal queue.
≒ Before de-queue, check if there is any task in the interrupt list for the
same job id. If there is one, execute that first.
In Queue J1 J2
J1 J1 J2 J1(H) J2 J1
finds a task in interrupt queue
In Interrupt List
J1(H)
20. Oozie Futures
≒ Easy adoption
Modeling tool
IDE integration
Modular Configurations
≒ Allow job notification through JMS
≒ Event-based data processing
≒ Prioritization
By user, system level.
21. Take Away ..
≒ Oozie is
In Apache!
Reliable and feature-rich.
Growing fast.
22. Q&A
Mohammad K Islam
kamrul@yahoo-inc.com
http://incubator.apache.org/oozie/
23. Who needs Oozie?
≒ Multiple jobs that have sequential/
conditional/parallel dependency
≒ Need to run job/Workflow periodically.
≒ Need to launch job when data is available.
≒ Operational requirements:
Easy monitoring
Reprocessing
Catch-up
24. Challenges : Queue Starvation
Problem:
≒ Consider queue with tasks of type T1 and T2. Max Concurrency = 2.
≒ Over-provisioned task (marked by red) is pushed back to the queue.
≒ At high load, it gets penalized in favor of same type, but later arrival
of tasks .
In Queue Running C (T1) C (T2)
T1 T2 T1 T1 T1 T2 T1 012 01
Starvation!
T1 cannot execute and is pushed to head of queue
25. Challenges : Queue Starvation
Resolution:
≒ Before de-queuing any task, check its concurrency.
≒ If violated, skip and get the next task.
In Queue Running C (T1) C (T2)
T1 T2 T1 T1 T1 T2 T1 012 01 2
Enqueue T2 now T1 cannot execute, so skip by one normallyfront
T1 now executes node to