際際滷

際際滷Share a Scribd company logo
Oozie Evolution
Gateway to Hadoop Eco-System


             Mohammad Islam
Agenda

≒   What is Oozie?
≒   What is in the Next Release?
≒   Challenges
≒   Future Works
≒   Q&A
Oozie in Hadoop Eco-System

                Oozie




                               HCatalog
        Pig    Sqoop    Hive
Oozie




              Map-Reduce

                  HDFS
Oozie : The Conductor
A Workflow Engine
≒ Oozie executes workflow defined as DAG of jobs
≒ The job type includes: Map-Reduce/Pig/Hive/Any script/
   Custom Java Code etc
                                      M/R
                                   streaming
                                       job


             M/R
  start               fork                           join
             job



                                     Pig                    MORE
                                                                          decision
                                     job



                                                        M/R                   ENOUGH
                                                        job




                                               FS
                             end                                   Java
                                               job
A Scheduler
≒ Oozie executes workflow based on:
    Time Dependency (Frequency)
    Data Dependency

                 Oozie Server
                                        Check
  WS API           Oozie            Data Availability
                 Coordinator

                   Oozie
 Oozie            Workflow
 Client                                     Hadoop
REST-API for Hadoop Components

≒ Direct access to Hadoop components
   Emulates the command line through REST
     API.
≒ Supported Products:
   Pig
   Map Reduce
Three Questions 
 Do you need Oozie?


Q1 : Do you have multiple jobs with
     dependency?
Q2 : Does your job start based on time or data
     availability?
Q3 : Do you need monitoring and operational
     support for your jobs?
   If any one of your answers is YES,
   then you should consider Oozie!
What Oozie is NOT

≒ Oozie is not a resource scheduler

≒ Oozie is not for off-grid scheduling
   o Note: Off-grid execution is possible through
   SSH action.

≒ If you want to submit your job occasionally,
   Oozie is an option.
    o Oozie provides REST API based submission.
Oozie in Apache
Main Contributors
Oozie in Apache

≒ Y! internal usages:
   Total number of user : 375
   Total number of processed jobs  750K/
     month
≒ External downloads:
   2500+ in last year from GitHub
   A large number of downloads maintained by
     3rd party packaging.
Oozie Usages Contd.

≒ User Community:
   Membership
    ≒ Y! internal - 286
    ≒ External  163
   Message (approximate)
    ≒ Y! internal  7/day
    ≒ External  8/day
Next Release 

≒ Integration with Hadoop 0.23

≒ HCatalog integration
   Non-polling approach
Usability

≒   Script Action
≒   Distcp Action
≒   Suspend Action
≒   Mini-Oozie for CI
      Like Mini-cluster
≒ Support multiple versions
      Pig, Distcp, Hive etc.
Reliability

≒ Auto-Retry in WF Action level

≒ High-Availability
   Hot-Warm through ZooKeeper
Manageability

≒ Email action

≒ Query Pig Stats/Hadoop Counters
   Runtime control of Workflow based on stats
   Application-level control using the stats
Challenges : Queue Starvation

≒ Which Queue?
   Not a Hadoop queue issue.
   Oozie internal queue to process the Oozie
     sub-tasks.
   Oozies main execution engine.
≒ User Problem :
   Jobs kill/suspend takes very long time.
Challenges : Queue Starvation
Technical Problem:
           ≒ Before   execution, every task acquires lock on the job id.
           ≒ Specialhigh-priority tasks (such as Kill or Suspend)
           couldnt get the lock and therefore, starve.


           In Queue                                          J1   J2

 J1   J1        J2      J1(H)   J2                           J1



       Starvation for High Priority Task!
Challenges : Queue Starvation
Resolution:
    ≒Add the high priority task in both the interrupt list and normal queue.
   ≒ Before de-queue, check if there is any task in the interrupt list for the
   same job id. If there is one, execute that first.



             In Queue                                                 J1    J2

 J1     J1         J2        J1(H)         J2                        J1

                finds a task in interrupt queue

             In Interrupt List

J1(H)
Oozie Futures

≒ Easy adoption
   Modeling tool
   IDE integration
   Modular Configurations
≒ Allow job notification through JMS
≒ Event-based data processing
≒ Prioritization
   By user, system level.
Take Away ..

≒ Oozie is
   In Apache!
   Reliable and feature-rich.
   Growing fast.
Q&A




                  Mohammad K Islam
               kamrul@yahoo-inc.com
      http://incubator.apache.org/oozie/
Who needs Oozie?

≒ Multiple jobs that have sequential/
   conditional/parallel dependency
≒ Need to run job/Workflow periodically.
≒ Need to launch job when data is available.
≒ Operational requirements:
   Easy monitoring
   Reprocessing
   Catch-up
Challenges : Queue Starvation
Problem:
                 ≒ Consider queue with tasks of type T1 and T2. Max Concurrency = 2.
                 ≒ Over-provisioned task (marked by red) is pushed back to the queue.
                 ≒ At high load, it gets penalized in favor of same type, but later arrival
                    of tasks .


             In Queue                                   Running             C (T1) C (T2)

T1      T2     T1       T1    T1     T2      T1                              012      01



     Starvation!
     T1 cannot execute and is pushed to head of queue
Challenges : Queue Starvation
Resolution:
            ≒ Before de-queuing any task, check its concurrency.
            ≒ If violated, skip and get the next task.


          In Queue                               Running           C (T1) C (T2)

T1   T2     T1       T1   T1    T2     T1                          012     01 2


Enqueue T2 now   T1 cannot execute, so skip by one normallyfront
                                T1 now executes node to

More Related Content

Oozie hugnov11

  • 1. Oozie Evolution Gateway to Hadoop Eco-System Mohammad Islam
  • 2. Agenda ≒ What is Oozie? ≒ What is in the Next Release? ≒ Challenges ≒ Future Works ≒ Q&A
  • 3. Oozie in Hadoop Eco-System Oozie HCatalog Pig Sqoop Hive Oozie Map-Reduce HDFS
  • 4. Oozie : The Conductor
  • 5. A Workflow Engine ≒ Oozie executes workflow defined as DAG of jobs ≒ The job type includes: Map-Reduce/Pig/Hive/Any script/ Custom Java Code etc M/R streaming job M/R start fork join job Pig MORE decision job M/R ENOUGH job FS end Java job
  • 6. A Scheduler ≒ Oozie executes workflow based on: Time Dependency (Frequency) Data Dependency Oozie Server Check WS API Oozie Data Availability Coordinator Oozie Oozie Workflow Client Hadoop
  • 7. REST-API for Hadoop Components ≒ Direct access to Hadoop components Emulates the command line through REST API. ≒ Supported Products: Pig Map Reduce
  • 8. Three Questions Do you need Oozie? Q1 : Do you have multiple jobs with dependency? Q2 : Does your job start based on time or data availability? Q3 : Do you need monitoring and operational support for your jobs? If any one of your answers is YES, then you should consider Oozie!
  • 9. What Oozie is NOT ≒ Oozie is not a resource scheduler ≒ Oozie is not for off-grid scheduling o Note: Off-grid execution is possible through SSH action. ≒ If you want to submit your job occasionally, Oozie is an option. o Oozie provides REST API based submission.
  • 10. Oozie in Apache Main Contributors
  • 11. Oozie in Apache ≒ Y! internal usages: Total number of user : 375 Total number of processed jobs 750K/ month ≒ External downloads: 2500+ in last year from GitHub A large number of downloads maintained by 3rd party packaging.
  • 12. Oozie Usages Contd. ≒ User Community: Membership ≒ Y! internal - 286 ≒ External 163 Message (approximate) ≒ Y! internal 7/day ≒ External 8/day
  • 13. Next Release ≒ Integration with Hadoop 0.23 ≒ HCatalog integration Non-polling approach
  • 14. Usability ≒ Script Action ≒ Distcp Action ≒ Suspend Action ≒ Mini-Oozie for CI Like Mini-cluster ≒ Support multiple versions Pig, Distcp, Hive etc.
  • 15. Reliability ≒ Auto-Retry in WF Action level ≒ High-Availability Hot-Warm through ZooKeeper
  • 16. Manageability ≒ Email action ≒ Query Pig Stats/Hadoop Counters Runtime control of Workflow based on stats Application-level control using the stats
  • 17. Challenges : Queue Starvation ≒ Which Queue? Not a Hadoop queue issue. Oozie internal queue to process the Oozie sub-tasks. Oozies main execution engine. ≒ User Problem : Jobs kill/suspend takes very long time.
  • 18. Challenges : Queue Starvation Technical Problem: ≒ Before execution, every task acquires lock on the job id. ≒ Specialhigh-priority tasks (such as Kill or Suspend) couldnt get the lock and therefore, starve. In Queue J1 J2 J1 J1 J2 J1(H) J2 J1 Starvation for High Priority Task!
  • 19. Challenges : Queue Starvation Resolution: ≒Add the high priority task in both the interrupt list and normal queue. ≒ Before de-queue, check if there is any task in the interrupt list for the same job id. If there is one, execute that first. In Queue J1 J2 J1 J1 J2 J1(H) J2 J1 finds a task in interrupt queue In Interrupt List J1(H)
  • 20. Oozie Futures ≒ Easy adoption Modeling tool IDE integration Modular Configurations ≒ Allow job notification through JMS ≒ Event-based data processing ≒ Prioritization By user, system level.
  • 21. Take Away .. ≒ Oozie is In Apache! Reliable and feature-rich. Growing fast.
  • 22. Q&A Mohammad K Islam kamrul@yahoo-inc.com http://incubator.apache.org/oozie/
  • 23. Who needs Oozie? ≒ Multiple jobs that have sequential/ conditional/parallel dependency ≒ Need to run job/Workflow periodically. ≒ Need to launch job when data is available. ≒ Operational requirements: Easy monitoring Reprocessing Catch-up
  • 24. Challenges : Queue Starvation Problem: ≒ Consider queue with tasks of type T1 and T2. Max Concurrency = 2. ≒ Over-provisioned task (marked by red) is pushed back to the queue. ≒ At high load, it gets penalized in favor of same type, but later arrival of tasks . In Queue Running C (T1) C (T2) T1 T2 T1 T1 T1 T2 T1 012 01 Starvation! T1 cannot execute and is pushed to head of queue
  • 25. Challenges : Queue Starvation Resolution: ≒ Before de-queuing any task, check its concurrency. ≒ If violated, skip and get the next task. In Queue Running C (T1) C (T2) T1 T2 T1 T1 T1 T2 T1 012 01 2 Enqueue T2 now T1 cannot execute, so skip by one normallyfront T1 now executes node to