River is a data workflow management system that handles multiple types of data processing workflows across multiple stages. It addresses problems with existing approaches that have hardcoded dependencies and logic scattered across systems. River provides execution management through a full execution history, monitoring, alerts, retries and a web UI. It also eases development through declarative data processing definitions and decentralized shared data. River uses data-driven dependencies to improve robustness, reliability and parallelism.
This document provides an overview of the MapReduce paradigm and Hadoop framework. It describes how MapReduce uses a map and reduce phase to process large amounts of distributed data in parallel. Hadoop is an open-source implementation of MapReduce that stores data in HDFS. It allows applications to work with thousands of computers and petabytes of data. Key advantages of MapReduce include fault tolerance, scalability, and flexibility. While it is well-suited for batch processing, it may not replace traditional databases for data warehousing. Overall efficiency remains an area for improvement.
Hanborq Optimizations on Hadoop MapReduceHanborq Inc.
油
A Hanborq optimized Hadoop Distribution, especially with high performance of MapReduce. It's the core part of HDH (Hanborq Distribution with Hadoop for Big Data Engineering).
NOVA is Ortech Power Solutions compact UPS range. Nova include superior technical performance and high reliabilty. Discover our UPS range.
With 0.9 output power factor, NOVA UPSs deliver 11 to 12% additiobal power compare to traditional 0.8 Output power factor systems.
This document discusses questions to consider when designing an industrial power backup system, including the UPS and battery technologies, autonomy, manufacturers, reliability, locations, communications, installation and maintenance programs, documentation, warranty, training, and life cycle. It emphasizes the complexity of selecting the right industrial uninterruptible power supply system and contacting manufacturers to ensure the optimal choice is made.
John Barrowman was born in Glasgow, Scotland but moved with his family to the United States as a child. He developed a love of performing through music lessons in school and competed in speech competitions. While studying in the UK, he landed his first professional acting role and has since had a successful career on stage in the West End. He is best known for his roles in the TV shows Doctor Who and Torchwood, where he played the character Captain Jack Harkness.
For the preliminary task, the student felt their time management, model coordination, and planning went well. However, for the main music magazine task, time management suffered due to issues with the main model that delayed work for 4 weeks. Overall, the student is disappointed in their performance for the main task but recognizes improvements from the preliminary to final product.
New Spring Line Interactive UPS series presentaton ortechPSMohammed Aggabi
油
This document provides information on the Spring 1-3KVA line-interactive UPS system, including its key features and specifications. It has a pure sine wave output with less than 3% THD and a rotatable LCD display. It comes in 1KVA, 1.5KVA, 2KVA and 3KVA capacities and can be used in both rack and tower formats. Additional features include an auto self-monitoring system, load segmentation, energy saving modes, and network management card compatibility.
The document contains lyrics from the song "The Kids Aren't Alright" by The Offspring. The 3 sentence summary is:
The lyrics describe a neighborhood that was once vibrant and full of promise for the children, but has now deteriorated with cracked streets as the kids grew up with worn lives, shattered dreams, unemployment, drug use, suicide and overdoses. It questions what happened to destroy the hopes and chances of so many lives on just one street and creates a somber picture of the cruel realities some faced. The document asks why one should listen to such a sadly themed song.
This document discusses APFC (Active Power Factor Correction) and how it relates to UPS systems. It explains that APFC improves energy efficiency but requires more energy during startup. Traditional UPSs cannot handle the startup surge and would indicate overload. However, the APFC Office UPS series uses patented PWM control to extend pulse widths and provide more startup energy without affecting frequency, allowing it to properly power loads with APFC without overloading. The APFC Office UPS thus eliminates the need for oversized UPSs just to accommodate APFC startup demands, saving cost and space.
Hybrid Solar Inverter 5kVA/4.2kW/48VDC/400Ah Off-Grid GenerationMohammed Aggabi
油
The document describes a hybrid inverter called the Hybrid Inverter 5048E. It has three key features:
1) It is an energy saving and backup power solution that uses both solar power and a battery bank to power loads and reduce electricity costs.
2) It automatically switches between 9 operation modes to optimize energy use by prioritizing solar and battery power over grid power.
3) It includes components like an MPPT solar charger, battery charger, AC inputs and outputs, and indicators to monitor its functioning.
The Scorpions were a German rock band formed in 1965 in Hanover by Rudolf Schenker. Rudolf and vocalist Klaus Meine became the band's songwriting duo and released their debut album in 1972. The Scorpions had a distinctive sound from dual electric guitars and Klaus Meine's recognizable vocals. They aimed for international success from the beginning. In 1989, the Scorpions performed at the Moscow Music Peace Festival to hundreds of thousands of fans, which inspired their hit song "Wind of Change" reflecting on the changing political climate in Russia.
This document proposes sponsorship and merchandising opportunities for the SAKA TEAM esports organization. It details their large online following across YouTube, Twitch, and social media, which is young, engaged, and interested in new products. Potential sponsor investments and branding placements are outlined, including on team clothing, videos, streams, and websites. Statistics on fan demographics and viewership are provided to showcase the value of partnering with SAKA TEAM to reach their audience. Contact details conclude the proposal.
The document discusses solar panel efficiency, electricity usage after nightfall in France, and the market for solar energy storage batteries. It provides statistics on the average solar panel efficiency currently and how new technologies will improve it over the next years. It also gives data on electricity usage in France after nightfall and forecasts nearly a 100% annual growth rate in the solar energy storage battery market over the next five years. It includes pricing comparisons for lead acid and Tesla batteries along with projections for further cost reductions.
Musics influence by Mikheyeva L., Knyazeva A. aesc-msu
油
Music can influence people's emotions and health in several ways. It can help relax the body, change moods, and aid sleep. Different types of music, such as classical, are found to positively influence brain waves and physiology. Certain rhythms and frequencies from some songs may also impact health and potentially cause stomach pain or mental issues if coinciding with brain frequencies. Overall, listening to music can benefit the cardiovascular system, lower stress, and even help prevent hearing loss.
The document discusses a problem and potential solutions presented by Mohammed Aggabi, CEO of Salle de Traite. It outlines sections on the problem, solutions, advantages, market opportunities, and product overview. Contact information is provided for Mohammed Aggabi to discuss this further.
The document discusses the power of music to express emotions that are otherwise inexpressible like love, pain, suffering, and happiness. It provides an example of the Russian rock band "Splin" as a type of music that can deeply convey feelings and emotions through its melodic and soulful songs. The author's favorite songs from Splin like "Fellini", "Lilichka", and "There is no way" provide a sense of freedom and support the author in both happy and sad moments.
The MEW Workshop is now established as a leading national event dedicated to distributed high performance scientific computing. The principle objective is to encourage close contact between the research communities from the Mathematics, Chemistry, Physics and Materials Programmes of EPSRC and the major vendors.
From ddd to DDD : My journey from data-driven development to Domain-Driven De...Thibaud Desodt
油
This will be a review of my progress through architecture styles and patterns and going over the transitions from Db-first style to cleaner OOP practices and proper domain isolation. I'll go over concepts such as "Transaction Script", "CQS", "Anemic Model" and other buzzwords
Expect some code snippets in C#
You can find the original of the slides here : https://github.com/tsimbalar/from-ddd-to-ddd
This document provides an overview and introduction to Hadoop, an open-source framework for storing and processing large datasets in a distributed computing environment. It discusses what Hadoop is, common use cases like ETL and analysis, key architectural components like HDFS and MapReduce, and why Hadoop is useful for solving problems involving "big data" through parallel processing across commodity hardware.
The document discusses big data and Hadoop. It provides an overview of key components in Hadoop including HDFS for storage, MapReduce for distributed processing, Hive for SQL-like queries, Pig for data flows, HBase for column-oriented storage, and Storm for real-time processing. It also discusses building a layered data system with batch, speed, and serving layers to process streaming data at scale.
This document discusses Oozie, a workflow scheduler system for Hadoop. It presents on what Oozie is, how it fits into the Hadoop ecosystem, and some of its key features. The next release of Oozie will integrate with more recent versions of Hadoop and improve usability, reliability, and manageability. One challenge discussed is queue starvation, where high priority tasks can be delayed. This is addressed by checking for and prioritizing high priority tasks before normal queue processing. Future work may focus on easier adoption and prioritization of jobs and workflows.
This document provides an overview of the MapReduce paradigm and Hadoop framework. It describes how MapReduce uses a map and reduce phase to process large amounts of distributed data in parallel. Hadoop is an open-source implementation of MapReduce that stores data in HDFS. It allows applications to work with thousands of computers and petabytes of data. Key advantages of MapReduce include fault tolerance, scalability, and flexibility. While it is well-suited for batch processing, it may not replace traditional databases for data warehousing. Overall efficiency remains an area for improvement.
JDK.IO 2016 (http://jdk.io)
Java EE 7 introduced a new batch processing API. This session will go over how to use the batch processing API introduced with Java EE 7. This API is makes it easy to implement long running data/compute intensive jobs which need to be scheduled or initiated on-demand. Basics of the API will be demonstrated via code samples. The API will also be compared to Spring Batching and Hadoop to provide context and guidance on when these technologies are appropriate.
This document discusses Oozie, a workflow scheduler system for Hadoop. It describes Oozie's role in coordinating and scheduling Hadoop jobs such as Pig, Hive, and MapReduce. The next release of Oozie will integrate better with Hadoop 0.23 and HCatalog, and add new features like script and Distcp actions. Challenges include addressing queue starvation when suspending or killing jobs. The future of Oozie includes easier adoption, job notifications through JMS, and event-based data processing.
Hanborq optimizations on hadoop map reduce 20120221aSchubert Zhang
油
Hanborq has developed optimizations to improve the performance of Hadoop MapReduce in three key areas:
1. The runtime environment uses a worker pool and improved scheduling to reduce job completion times from tens of seconds to near real-time.
2. The processing engine utilizes techniques like sendfile for zero-copy data transfer and Netty batch fetching to reduce network overhead and CPU usage during shuffling.
3. Sort avoidance algorithms are implemented to minimize expensive sorting operations through techniques such as early reduce and hash aggregation.
This document summarizes a proposal to improve fault tolerance in Hadoop clusters. It proposes adding a "Backup" state to store intermediate MapReduce data, so reducers can continue working even if mappers fail. It also proposes a "supernode" protocol where neighboring slave nodes communicate task information. If one node fails, a neighbor can take over its tasks without involving the JobTracker. This would improve fault tolerance by allowing computation to continue locally between nodes after failures.
Oozie is a workflow scheduling system for Hadoop that allows users to manage workflows as directed acyclic graphs (DAGs) of Hadoop jobs such as MapReduce, Pig, Hive, and Sqoop. It executes workflows based on time and data dependencies and provides a web interface for monitoring jobs. Oozie was designed specifically for Hadoop to take advantage of its features while addressing its shortcomings for workflow management.
The document discusses Apache Tez, a framework for building data processing applications on Hadoop. It provides an introduction to Tez and describes key features like expressing computations as directed acyclic graphs (DAGs), container reuse, dynamic parallelism, integration with YARN timeline service, and recovery from failures. The document also outlines improvements to Tez around performance, debuggability, and status/roadmap.
This document discusses APFC (Active Power Factor Correction) and how it relates to UPS systems. It explains that APFC improves energy efficiency but requires more energy during startup. Traditional UPSs cannot handle the startup surge and would indicate overload. However, the APFC Office UPS series uses patented PWM control to extend pulse widths and provide more startup energy without affecting frequency, allowing it to properly power loads with APFC without overloading. The APFC Office UPS thus eliminates the need for oversized UPSs just to accommodate APFC startup demands, saving cost and space.
Hybrid Solar Inverter 5kVA/4.2kW/48VDC/400Ah Off-Grid GenerationMohammed Aggabi
油
The document describes a hybrid inverter called the Hybrid Inverter 5048E. It has three key features:
1) It is an energy saving and backup power solution that uses both solar power and a battery bank to power loads and reduce electricity costs.
2) It automatically switches between 9 operation modes to optimize energy use by prioritizing solar and battery power over grid power.
3) It includes components like an MPPT solar charger, battery charger, AC inputs and outputs, and indicators to monitor its functioning.
The Scorpions were a German rock band formed in 1965 in Hanover by Rudolf Schenker. Rudolf and vocalist Klaus Meine became the band's songwriting duo and released their debut album in 1972. The Scorpions had a distinctive sound from dual electric guitars and Klaus Meine's recognizable vocals. They aimed for international success from the beginning. In 1989, the Scorpions performed at the Moscow Music Peace Festival to hundreds of thousands of fans, which inspired their hit song "Wind of Change" reflecting on the changing political climate in Russia.
This document proposes sponsorship and merchandising opportunities for the SAKA TEAM esports organization. It details their large online following across YouTube, Twitch, and social media, which is young, engaged, and interested in new products. Potential sponsor investments and branding placements are outlined, including on team clothing, videos, streams, and websites. Statistics on fan demographics and viewership are provided to showcase the value of partnering with SAKA TEAM to reach their audience. Contact details conclude the proposal.
The document discusses solar panel efficiency, electricity usage after nightfall in France, and the market for solar energy storage batteries. It provides statistics on the average solar panel efficiency currently and how new technologies will improve it over the next years. It also gives data on electricity usage in France after nightfall and forecasts nearly a 100% annual growth rate in the solar energy storage battery market over the next five years. It includes pricing comparisons for lead acid and Tesla batteries along with projections for further cost reductions.
Musics influence by Mikheyeva L., Knyazeva A. aesc-msu
油
Music can influence people's emotions and health in several ways. It can help relax the body, change moods, and aid sleep. Different types of music, such as classical, are found to positively influence brain waves and physiology. Certain rhythms and frequencies from some songs may also impact health and potentially cause stomach pain or mental issues if coinciding with brain frequencies. Overall, listening to music can benefit the cardiovascular system, lower stress, and even help prevent hearing loss.
The document discusses a problem and potential solutions presented by Mohammed Aggabi, CEO of Salle de Traite. It outlines sections on the problem, solutions, advantages, market opportunities, and product overview. Contact information is provided for Mohammed Aggabi to discuss this further.
The document discusses the power of music to express emotions that are otherwise inexpressible like love, pain, suffering, and happiness. It provides an example of the Russian rock band "Splin" as a type of music that can deeply convey feelings and emotions through its melodic and soulful songs. The author's favorite songs from Splin like "Fellini", "Lilichka", and "There is no way" provide a sense of freedom and support the author in both happy and sad moments.
The MEW Workshop is now established as a leading national event dedicated to distributed high performance scientific computing. The principle objective is to encourage close contact between the research communities from the Mathematics, Chemistry, Physics and Materials Programmes of EPSRC and the major vendors.
From ddd to DDD : My journey from data-driven development to Domain-Driven De...Thibaud Desodt
油
This will be a review of my progress through architecture styles and patterns and going over the transitions from Db-first style to cleaner OOP practices and proper domain isolation. I'll go over concepts such as "Transaction Script", "CQS", "Anemic Model" and other buzzwords
Expect some code snippets in C#
You can find the original of the slides here : https://github.com/tsimbalar/from-ddd-to-ddd
This document provides an overview and introduction to Hadoop, an open-source framework for storing and processing large datasets in a distributed computing environment. It discusses what Hadoop is, common use cases like ETL and analysis, key architectural components like HDFS and MapReduce, and why Hadoop is useful for solving problems involving "big data" through parallel processing across commodity hardware.
The document discusses big data and Hadoop. It provides an overview of key components in Hadoop including HDFS for storage, MapReduce for distributed processing, Hive for SQL-like queries, Pig for data flows, HBase for column-oriented storage, and Storm for real-time processing. It also discusses building a layered data system with batch, speed, and serving layers to process streaming data at scale.
This document discusses Oozie, a workflow scheduler system for Hadoop. It presents on what Oozie is, how it fits into the Hadoop ecosystem, and some of its key features. The next release of Oozie will integrate with more recent versions of Hadoop and improve usability, reliability, and manageability. One challenge discussed is queue starvation, where high priority tasks can be delayed. This is addressed by checking for and prioritizing high priority tasks before normal queue processing. Future work may focus on easier adoption and prioritization of jobs and workflows.
This document provides an overview of the MapReduce paradigm and Hadoop framework. It describes how MapReduce uses a map and reduce phase to process large amounts of distributed data in parallel. Hadoop is an open-source implementation of MapReduce that stores data in HDFS. It allows applications to work with thousands of computers and petabytes of data. Key advantages of MapReduce include fault tolerance, scalability, and flexibility. While it is well-suited for batch processing, it may not replace traditional databases for data warehousing. Overall efficiency remains an area for improvement.
JDK.IO 2016 (http://jdk.io)
Java EE 7 introduced a new batch processing API. This session will go over how to use the batch processing API introduced with Java EE 7. This API is makes it easy to implement long running data/compute intensive jobs which need to be scheduled or initiated on-demand. Basics of the API will be demonstrated via code samples. The API will also be compared to Spring Batching and Hadoop to provide context and guidance on when these technologies are appropriate.
This document discusses Oozie, a workflow scheduler system for Hadoop. It describes Oozie's role in coordinating and scheduling Hadoop jobs such as Pig, Hive, and MapReduce. The next release of Oozie will integrate better with Hadoop 0.23 and HCatalog, and add new features like script and Distcp actions. Challenges include addressing queue starvation when suspending or killing jobs. The future of Oozie includes easier adoption, job notifications through JMS, and event-based data processing.
Hanborq optimizations on hadoop map reduce 20120221aSchubert Zhang
油
Hanborq has developed optimizations to improve the performance of Hadoop MapReduce in three key areas:
1. The runtime environment uses a worker pool and improved scheduling to reduce job completion times from tens of seconds to near real-time.
2. The processing engine utilizes techniques like sendfile for zero-copy data transfer and Netty batch fetching to reduce network overhead and CPU usage during shuffling.
3. Sort avoidance algorithms are implemented to minimize expensive sorting operations through techniques such as early reduce and hash aggregation.
This document summarizes a proposal to improve fault tolerance in Hadoop clusters. It proposes adding a "Backup" state to store intermediate MapReduce data, so reducers can continue working even if mappers fail. It also proposes a "supernode" protocol where neighboring slave nodes communicate task information. If one node fails, a neighbor can take over its tasks without involving the JobTracker. This would improve fault tolerance by allowing computation to continue locally between nodes after failures.
Oozie is a workflow scheduling system for Hadoop that allows users to manage workflows as directed acyclic graphs (DAGs) of Hadoop jobs such as MapReduce, Pig, Hive, and Sqoop. It executes workflows based on time and data dependencies and provides a web interface for monitoring jobs. Oozie was designed specifically for Hadoop to take advantage of its features while addressing its shortcomings for workflow management.
The document discusses Apache Tez, a framework for building data processing applications on Hadoop. It provides an introduction to Tez and describes key features like expressing computations as directed acyclic graphs (DAGs), container reuse, dynamic parallelism, integration with YARN timeline service, and recovery from failures. The document also outlines improvements to Tez around performance, debuggability, and status/roadmap.
What is Distributed Computing, Why we use Apache SparkAndy Petrella
油
In this talk we introduce the notion of distributed computing then we tackle the Spark advantages.
The Spark core content is very tiny because the whole explanation has been done live using a Spark Notebook (https://github.com/andypetrella/spark-notebook/blob/geek/conf/notebooks/Geek.snb).
This talk has been given together by @xtordoir and myself at the University of Li竪ge, Belgium.
High Performance Computing - Cloud Point of Viewaragozin
油
This document discusses high performance computing in the cloud. It covers different types of workloads like I/O bound, CPU bound, and latency bound tasks. It also discusses handling task streams and structured batch jobs in the cloud. It proposes using techniques like worker pools, task queues, routing overlays, and task stealing for scheduling tasks. It discusses challenges around distributing large data sets across cloud resources and proposes solutions like caching data in memory grids. Finally, it argues that frameworks like Hadoop are not well suited for the cloud and proposes cloud-friendly alternatives like Peregrine and Spark.
Massively scalable ETL in real world applications: the hard wayJ On The Beach
油
Big Data examples always give the correct answers. However, in the real world, Big Data might be corrupt, contradictory or consist of so many small files it becomes extremely hard to keep track - let alone scale. A solid architecture will help to overcome many of the difficulties.
Floris will talk about a real-world implementation of a massively scalable ETL architecture. Two years ago, at the time of the implementation, Airflow just became part of Apache and still left many features to be desired for. However, requirements from the start were thousands of ETL tasks per day on average, but on occasion, this could become hundreds of thousands. The script-based method that was in place was already not capable to meet the requirements on a day to day basis and needed to be replaced as soon as possible. So this custom framework was rolled out in just 8 weeks of development time.
Twitter's operations team manages software performance, availability, capacity planning, and configuration management. They use metrics, logs, and analysis to find weak points and take corrective action. Some techniques include caching everything possible, moving operations to asynchronous daemons, optimizing databases, and instrumenting all systems. Their goal is to process requests asynchronously when possible and avoid overloading relational databases.
Fixing Twitter and Finding your own Fail Whale document discusses Twitter operations. The Twitter operations team focuses on software performance, availability, capacity planning, and configuration management using metrics, logs, and science. They use a dedicated managed services team and run their own servers instead of cloud services. The document outlines Twitter's rapid growth and challenges in maintaining performance. It discusses strategies for monitoring, analyzing metrics to find weak points, deploying changes, and improving processes through configuration management and peer reviews.
2. Tens of Billions of Recommendations per month
Most major publishers in the World
Hundreds GBs of new data every day
3. Context
Data Processing Workflows
Multiple Types of Processing
Rollups, Grouping, Filtering, Algorithm
Calculations
Multiple Stages of Processing
Using the output of other processes as input
4. Problems
Dependency Management
Hardcoded into code/scripts
Time-based using cron or another scheduler
Logic is scattered around the system
Developers need to take care of
monitoring, alerts, permissions etc.
Multiple Locations of Execution
6. River
Execution Management
Full Execution History and Filtering
Monitoring and Actionable Alerting Ops / NOC
Automatic Retries
Web UI
Ease of Development
Declarative Data Processing Definitions
Decentralized Developers
Shared Data, separate development
JobLogs
Data Driven Dependencies
Why?
9. Other Approaches
D Fails
D sends email
Developer of D
still works here
Where is the code?
10. Other Approaches
2am is a
D= great hour for
troubleshooting!
Data from C is missing
C= The data of C
is all there!
11. Other Approaches
X:37 seems like a
good time C never
finished after X:30
anyway
A B C t
Job J has been working for
more than a week before
the incident
D
12. Other Approaches
Need to rerun processes B, C and D
Which hours failed?
How to run all of them for the specific hours?
Without running A again?
Without colliding with ongoing executions?
13. Other Approaches
A will never take more
than 15 minutes, so X:20 is more than enough
A
X:00
t
J
A WILL eventually take longer
14. River
Execution Management
Full Execution History + Filtering and Searching
Monitoring and Actionable Alerting
Automatic Retries
Web UI
JobLogs
Ease of Development
Declarative Data Processing Definitions
Decentralized
Shared Data, separate development
Data Driven Dependencies
Why?
Robustness Reliability Parallelism
16. Execution Layer the What
Every data processing task is called a Job
A Job can contain multiple Steps
Importing from MySQL to Hive
Hive Queries
JDBC Queries
Transfer data from Hive into MySQL and to Cassandra
Running External Commands:
MapReduce, Java, bash, Legacy code, etc.
Jobs use Parameters
17. Scheduling Layer the When
Each job registers to an event, which will trigger its execution
Each job emits an event at job completion
Events that describe Data Availability Events that are time dependent
18. The How and the Where
Both handled by the infrastructure
Integration to other systems
Connecting to Hive/Hadoop/Cassandra Logical names to
all data sources
Connecting to JDBC Databases
readOnlyDataWarehouse
Retries, throttling, timeouts productionCassandra
Monitoring and Alerts Centralized Management, email
notifications and dashboards
Location of Execution Actual location is hidden from the
developer/ops
19. River UI
FailDownload JobLog
Job and Dependents
Restart Job
22. Steps
Copy Data From JDBC to Hive
sourceDB = productionDatabase
sourceTable = myRawData
targetCluster = onlineHadoopCluster
targetHiveTable = rawDataTable
Filter = date=#handledDate#
Steps only contain what needs to be done
23. A bit more about triggers
Triggers have parameters as well
Date=2012-10-10,hour=15 Date=2012-10-10,hour=19
Parameters Propagate through jobs and to other triggers
25. Trigger Queue Execution Queue
River
Trigger Execution Spring
Manager Manager Batch
Topology Spring Batch DB
Hive/Hadoop OS Cassandra JDBC
Interface Interface Inerface Interface
External
Systems
27. Trigger Queue Execution Queue
Date=2012-01-02 T1 T2
T3 Job1,Job2
Job3
hour=03 Date=2012-01-02
Date=2012-01-02
hour=03
hour=03
Job1
Job2
River T1
T3
T2 Job3
Job1,Job2
Trigger Execution Spring
Manager Manager Batch
Job1,Job2
Job3
Topology Spring Batch DB
(from Job1) (from Job2)
Hive/Hadoop OS Cassandra JDBC
Interface Interface Inerface Interface
External
Systems
Success Example
28. UI
Trigger Queue Execution Queue Job2
T3 Job3 Job2
Date=2012-01-02 Date=2012-01-02
hour=03 hour=03
Job2
Job2
River T3 Job2
Job3
Trigger Execution Spring
Manager Manager Batch
Job3
Topology Spring Batch DB
Hive/Hadoop OS Cassandra JDBC
Interface Interface Inerface Interface
External
Systems
Failure Example
29. Notable Features
Parameter Enrichment
Example: #beginningOfMonth
Precondition Expressions
Example: isLastDayOfMonth(#handleDate)
Data Comparison Capabilities
Data Validations
Supports Tolerance
Absolute and Percentage margins
Command Line and Java Clients
30. River at
6 River Instances Running
5 Teams
~4100 Jobs running every day
~50 Different Job Types
Job Failures due to environment issues have
almost no overhead
Automatic restarts of jobs when data arrives late
31. Illustration by Chris Whetzel
Future Plans
Multiple Dependencies
Offline Job Testing Capabilities
Improved DSL for Job Definitions
Support for Master/Worker River machines
Job Priorities
Analysis Tools
Outbrain is working on Open Sourcing River