High Performance Distributed Systems with CQRSJonathan Oliver
?
This document discusses the architectural pattern of Command Query Responsibility Segregation (CQRS). It summarizes that CQRS separates read (query) and write (command) operations into different models to allow for more scalability and performance. Queries use a read-only data store optimized for reading, while commands express user intentions and are validated before being asynchronously processed to update data. The pattern allows for eventual consistency by keeping query data slightly stale, and improves scalability by allowing separate optimization of queries and commands.
Building Your First Apache Apex (Next Gen Big Data/Hadoop) ApplicationApache Apex
?
This document provides an overview of building a first Apache Apex application. It describes the main concepts of an Apex application including operators that implement interfaces to process streaming data within windows. The document outlines a "Sorted Word Count" application that uses various operators like LineReader, WordReader, WindowWordCount, and FileWordCount. It also demonstrates wiring these operators together in a directed acyclic graph and running the application to process streaming data.
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Apache Apex
?
Presenter:
Priyanka Gugale, Committer for Apache Apex and Software Engineer at DataTorrent.
In this session we will cover introduction to Yarn, understanding yarn architecture as well as look into Yarn application lifecycle. We will also learn how Apache Apex is one of the Yarn applications in Hadoop.
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Spark Summit
?
Scaling Spark workloads on YARN and Mesos can provide significant performance improvements but the benefits vary across different workloads. Adding resources alone may not fully utilize the new nodes due to delay in scheduling tasks locally on the new nodes. Tuning the locality wait time parameter in Spark to quickly change task placement preference can help make better use of new resources. Dynamic executor allocation in Spark can also be enhanced to dynamically adjust configuration settings like locality wait time during auto-scaling.
Windowing in Apache Apex divides unbounded streaming data into finite time slices called windows to allow for computation. It uses time as a reference to break streams into windows, addressing issues like failure recovery and providing frames of reference. Operators can perform window-level processing by implementing callbacks for window start and end. Windows provide rolling statistics by accumulating results over multiple windows and emitting periodically. Windowing has lower latency than micro-batch systems as records are processed immediately rather than waiting for batch boundaries.
The 5 People in your Organization that grow Legacy CodeRoberto Cortez
?
Have you ever looked at a random piece of code and wanted to rewrite it so badly? It¡¯s natural to have legacy code in your application at some point. It¡¯s something that you need to accept and learn to live with. So is this a lost cause? Should we just throw in the towel and give up? Hell no! Over the years, I learned to identify 5 main creators/enablers of legacy code on the engineering side, which I¡¯m sharing here with you using real development stories (with a little humour in the mix). Learn to keep them in line and your code will live longer!
This document provides an overview of basic Hadoop commands for interacting with the Hadoop Distributed File System (HDFS). It lists commands for creating directories, listing files, copying data between local and HDFS, copying within HDFS, viewing file contents, deleting files, getting help for commands, and viewing HDFS through a web browser. Contact information is provided at the end for additional support.
Introduction to Apache Apex and writing a big data streaming application Apache Apex
?
Introduction to Apache Apex - The next generation native Hadoop platform, and writing a native Hadoop big data Apache Apex streaming application.
This talk will cover details about how Apex can be used as a powerful and versatile platform for big data. Apache apex is being used in production by customers for both streaming and batch use cases. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch. alerts, real-time actions, threat detection, etc.
Presenter : <b>Pramod Immaneni</b> Apache Apex PPMC member and senior architect at DataTorrent Inc, where he works on Apex and specializes in big data applications. Prior to DataTorrent he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs. Before that he was a technical co-founder of a mobile startup where he was an architect of a dynamic content rendering engine for mobile devices.
This is a video of the webcast of an Apache Apex meetup event organized by Guru Virtues at 267 Boston Rd no. 9, North Billerica, MA, on <b>May 7th 2016</b> and broadcasted from San Jose, CA. If you are interested in helping organize i.e., hosting, presenting, community leadership Apache Apex community, please email apex-meetup@datatorrent.com
HDFS stores files as blocks that are by default 64 MB in size to minimize disk seek times. The namenode manages the file system namespace and metadata, tracking which datanodes store each block. When writing a file, HDFS breaks it into blocks and replicates each block across multiple datanodes. The secondary namenode periodically merges namespace and edit log changes to prevent the log from growing too large. Small files are inefficient in HDFS due to each file requiring namespace metadata regardless of size.
Apache Hadoop: design and implementation. Lecture in the Big data computing course (http://twiki.di.uniroma1.it/twiki/view/BDC/WebHome), Department of Computer Science, Sapienza University of Rome.
ºÝºÝߣs from the Introduction to UNIX Command-Lines class from the BTI Plant Bioinformatics course 2014. This is a course teach by the Sol Genomics Network researchers at the Boyce Thompson Institute.
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
?
Stream processing applications built on Apache Apex run on Hadoop clusters and typically power analytics use cases where availability, flexible scaling, high throughput, low latency and correctness are essential. These applications consume data from a variety of sources, including streaming sources like Apache Kafka, Kinesis or JMS, file based sources or databases. Processing results often need to be stored in external systems (sinks) for downstream consumers (pub-sub messaging, real-time visualization, Hive and other SQL databases etc.). Apex has the Malhar library with a wide range of connectors and other operators that are readily available to build applications. We will cover key characteristics like partitioning and processing guarantees, generic building blocks for new operators (write-ahead-log, incremental state saving, windowing etc.) and APIs for application specification.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
?
ºÝºÝߣs cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex
?
Presenter: Devendra Tagare - DataTorrent Engineer, Contributor to Apex, Data Architect experienced in building high scalability big data platforms.
Apache Apex is a next generation native Hadoop big data platform. This talk will cover details about how it can be used as a powerful and versatile platform for big data.
Apache Apex is a native Hadoop data-in-motion platform. We will discuss architectural differences between Apache Apex features with Spark Streaming. We will discuss how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion.
We will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. We will also discuss how these features affect time to market and total cost of ownership.
Capital One's Next Generation Decision in less than 2 msApache Apex
?
This document discusses using Apache Apex for real-time decision making within 2 milliseconds. It provides performance benchmarks for Apex, showing average latency of 0.25ms for over 54 million events with 600GB of RAM. It compares Apex favorably to other streaming technologies like Storm and Flink, noting Apex's self-healing capabilities, independence of operators, and ability to meet latency and throughput requirements even during failures. The document recommends Apex for its maturity, fault tolerance, and ability to meet the goals of latency under 16ms, 99.999% availability, and scalability.
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)Apache Apex
?
This presentation will introduce usage of Apache Apex for Time Series & Data Ingestion Service by General Electric Internet of things Predix platform. Apache Apex is a native Hadoop data in motion platform that is being used by customers for both streaming as well as batch processing. Common use cases include ingestion into Hadoop, streaming analytics, ETL, database off-loads, alerts and monitoring, machine model scoring, etc.
Abstract: Predix is an General Electric platform for Internet of Things. It helps users develop applications that connect industrial machines with people through data and analytics for better business outcomes. Predix offers a catalog of services that provide core capabilities required by industrial internet applications. We will deep dive into Predix Time Series and Data Ingestion services leveraging fast, scalable, highly performant, and fault tolerant capabilities of Apache Apex.
Speakers:
- Venkatesh Sivasubramanian, Sr Staff Software Engineer, GE Predix & Committer of Apache Apex
- Pramod Immaneni, PPMC member of Apache Apex, and DataTorrent Architect
Windowing in Apache Apex divides unbounded streaming data into finite time slices called windows to allow for computation. It uses time as a reference to break streams into windows, addressing issues like failure recovery and providing frames of reference. Operators can perform window-level processing by implementing callbacks for window start and end. Windows provide rolling statistics by accumulating results over multiple windows and emitting periodically. Windowing has lower latency than micro-batch systems as records are processed immediately rather than waiting for batch boundaries.
The 5 People in your Organization that grow Legacy CodeRoberto Cortez
?
Have you ever looked at a random piece of code and wanted to rewrite it so badly? It¡¯s natural to have legacy code in your application at some point. It¡¯s something that you need to accept and learn to live with. So is this a lost cause? Should we just throw in the towel and give up? Hell no! Over the years, I learned to identify 5 main creators/enablers of legacy code on the engineering side, which I¡¯m sharing here with you using real development stories (with a little humour in the mix). Learn to keep them in line and your code will live longer!
This document provides an overview of basic Hadoop commands for interacting with the Hadoop Distributed File System (HDFS). It lists commands for creating directories, listing files, copying data between local and HDFS, copying within HDFS, viewing file contents, deleting files, getting help for commands, and viewing HDFS through a web browser. Contact information is provided at the end for additional support.
Introduction to Apache Apex and writing a big data streaming application Apache Apex
?
Introduction to Apache Apex - The next generation native Hadoop platform, and writing a native Hadoop big data Apache Apex streaming application.
This talk will cover details about how Apex can be used as a powerful and versatile platform for big data. Apache apex is being used in production by customers for both streaming and batch use cases. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch. alerts, real-time actions, threat detection, etc.
Presenter : <b>Pramod Immaneni</b> Apache Apex PPMC member and senior architect at DataTorrent Inc, where he works on Apex and specializes in big data applications. Prior to DataTorrent he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs. Before that he was a technical co-founder of a mobile startup where he was an architect of a dynamic content rendering engine for mobile devices.
This is a video of the webcast of an Apache Apex meetup event organized by Guru Virtues at 267 Boston Rd no. 9, North Billerica, MA, on <b>May 7th 2016</b> and broadcasted from San Jose, CA. If you are interested in helping organize i.e., hosting, presenting, community leadership Apache Apex community, please email apex-meetup@datatorrent.com
HDFS stores files as blocks that are by default 64 MB in size to minimize disk seek times. The namenode manages the file system namespace and metadata, tracking which datanodes store each block. When writing a file, HDFS breaks it into blocks and replicates each block across multiple datanodes. The secondary namenode periodically merges namespace and edit log changes to prevent the log from growing too large. Small files are inefficient in HDFS due to each file requiring namespace metadata regardless of size.
Apache Hadoop: design and implementation. Lecture in the Big data computing course (http://twiki.di.uniroma1.it/twiki/view/BDC/WebHome), Department of Computer Science, Sapienza University of Rome.
ºÝºÝߣs from the Introduction to UNIX Command-Lines class from the BTI Plant Bioinformatics course 2014. This is a course teach by the Sol Genomics Network researchers at the Boyce Thompson Institute.
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
?
Stream processing applications built on Apache Apex run on Hadoop clusters and typically power analytics use cases where availability, flexible scaling, high throughput, low latency and correctness are essential. These applications consume data from a variety of sources, including streaming sources like Apache Kafka, Kinesis or JMS, file based sources or databases. Processing results often need to be stored in external systems (sinks) for downstream consumers (pub-sub messaging, real-time visualization, Hive and other SQL databases etc.). Apex has the Malhar library with a wide range of connectors and other operators that are readily available to build applications. We will cover key characteristics like partitioning and processing guarantees, generic building blocks for new operators (write-ahead-log, incremental state saving, windowing etc.) and APIs for application specification.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
?
ºÝºÝߣs cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
Intro to Apache Apex (next gen Hadoop) & comparison to Spark StreamingApache Apex
?
Presenter: Devendra Tagare - DataTorrent Engineer, Contributor to Apex, Data Architect experienced in building high scalability big data platforms.
Apache Apex is a next generation native Hadoop big data platform. This talk will cover details about how it can be used as a powerful and versatile platform for big data.
Apache Apex is a native Hadoop data-in-motion platform. We will discuss architectural differences between Apache Apex features with Spark Streaming. We will discuss how these differences effect use cases like ingestion, fast real-time analytics, data movement, ETL, fast batch, very low latency SLA, high throughput and large scale ingestion.
We will cover fault tolerance, low latency, connectors to sources/destinations, smart partitioning, processing guarantees, computation and scheduling model, state management and dynamic changes. We will also discuss how these features affect time to market and total cost of ownership.
Capital One's Next Generation Decision in less than 2 msApache Apex
?
This document discusses using Apache Apex for real-time decision making within 2 milliseconds. It provides performance benchmarks for Apex, showing average latency of 0.25ms for over 54 million events with 600GB of RAM. It compares Apex favorably to other streaming technologies like Storm and Flink, noting Apex's self-healing capabilities, independence of operators, and ability to meet latency and throughput requirements even during failures. The document recommends Apex for its maturity, fault tolerance, and ability to meet the goals of latency under 16ms, 99.999% availability, and scalability.
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)Apache Apex
?
This presentation will introduce usage of Apache Apex for Time Series & Data Ingestion Service by General Electric Internet of things Predix platform. Apache Apex is a native Hadoop data in motion platform that is being used by customers for both streaming as well as batch processing. Common use cases include ingestion into Hadoop, streaming analytics, ETL, database off-loads, alerts and monitoring, machine model scoring, etc.
Abstract: Predix is an General Electric platform for Internet of Things. It helps users develop applications that connect industrial machines with people through data and analytics for better business outcomes. Predix offers a catalog of services that provide core capabilities required by industrial internet applications. We will deep dive into Predix Time Series and Data Ingestion services leveraging fast, scalable, highly performant, and fault tolerant capabilities of Apache Apex.
Speakers:
- Venkatesh Sivasubramanian, Sr Staff Software Engineer, GE Predix & Committer of Apache Apex
- Pramod Immaneni, PPMC member of Apache Apex, and DataTorrent Architect
This document summarizes Joshua Hoffman's talk on scalable system operations at Tumblr. The talk outlines Tumblr's management stack for automating server provisioning including iPXE, Invisible Touch, Collins, Phil, Kickstart, and Puppet. It describes how the tools are used together in workflows for server intake, provisioning, and addressing challenges like configuring networking and storage during installation. The talk emphasizes principles like modularity, simplicity, and avoiding breaking the operating system.
Benchmarks, performance, scalability, and capacity what s behind the numbers...james tong
?
Baron Schwartz gave a presentation on analyzing database performance beyond surface-level metrics and benchmarks. He discussed how ideal benchmarks provide full system specifications and metrics over time to understand response times and throughput. Little's Law and queueing theory can predict concurrency, response times, and capacity given arrival rates and service times. While tools like Erlang C model queues, the assumptions must be validated. True scalability is nonlinear due to bottlenecks, and debunking performance claims requires examining raw data.
The document discusses various stability antipatterns that can cause systems to fail. It describes issues that can arise from integration points, database calls that hang, unexpected failures from external systems, traffic surges overwhelming a system, attacks from users, unbalanced capacities across systems, unbounded result sets, and more. It provides examples of each antipattern and emphasizes the importance of monitoring dependencies, using timeouts, testing with realistic data volumes, and implementing circuit breakers and other proven patterns to prevent failures from cascading across systems and spreading.
The right read optimization is actually write optimizationjames tong
?
Fractal Tree indexes provide a way to optimize both reads and writes for large, growing datasets. They achieve this by combining aspects of log-structured merge trees (LSM trees) and B-trees - buffering data during writes like LSM trees to batch inserts, but maintaining a B-tree structure for efficient queries. This allows fractal tree indexes to have very fast insertion performance like LSM trees while also supporting fast queries like B-tree indexes. However, fractal tree indexes do introduce more complexity in the tree structure that can make concurrency more difficult.