�ݺ�ߣshows by User: arunkejariwal

�ݺ�ߣshows by User: arunkejariwal / http://www.slideshare.net/images/logo.gif �ݺ�ߣshows by User: arunkejariwal / Thu, 25 Jun 2020 05:44:20 GMT �ݺ�ߣShare feed for �ݺ�ߣshows by User: arunkejariwal Anomaly Detection At The Edge /slideshow/anomaly-detection-at-the-edge/236189536 adattheedge-200625054421
In the wake of IoT becoming ubiquitous, there has been a large interest in the industry to develop novel techniques for anomaly detection at the Edge. Example applications include, but not limited to, smart cities/grids of sensors, industrial process control in manufacturing, smart home, wearables, connected vehicles, agriculture (sensing for soil moisture and nutrients). What makes anomaly detection at the Edge different? The following constraints be it due to the sensors or the applications necessitate the need for the development of new algorithms for AD. * Very low power and low compute/memory resources * High data volume making centralized AD infeasible owing to the communication overhead * Need for low latency to drive fast action taking Guaranteeing privacy In this talk we shall throw light on the above in detail. Subsequently, we shall walk through the algorithm design process for anomaly detection at the Edge. Specifically, we shall dive into the need to build small models/ensembles owing to limited memory on the sensors. Further, how to training data in an online fashion as long term historical data is not available due to limited storage. Given the need for data compression to contain the communication overhead, can one carry out anomaly detection on compressed data? We shall throw light on building of small models, sequential and one-shot learning algorithms, compressing the data with the models and limiting the communication to only the data corresponding to the anomalies and model description. We shall illustrate the above with concrete examples from the wild!]]>
In the wake of IoT becoming ubiquitous, there has been a large interest in the industry to develop novel techniques for anomaly detection at the Edge. Example applications include, but not limited to, smart cities/grids of sensors, industrial process control in manufacturing, smart home, wearables, connected vehicles, agriculture (sensing for soil moisture and nutrients). What makes anomaly detection at the Edge different? The following constraints be it due to the sensors or the applications necessitate the need for the development of new algorithms for AD. * Very low power and low compute/memory resources * High data volume making centralized AD infeasible owing to the communication overhead * Need for low latency to drive fast action taking Guaranteeing privacy In this talk we shall throw light on the above in detail. Subsequently, we shall walk through the algorithm design process for anomaly detection at the Edge. Specifically, we shall dive into the need to build small models/ensembles owing to limited memory on the sensors. Further, how to training data in an online fashion as long term historical data is not available due to limited storage. Given the need for data compression to contain the communication overhead, can one carry out anomaly detection on compressed data? We shall throw light on building of small models, sequential and one-shot learning algorithms, compressing the data with the models and limiting the communication to only the data corresponding to the anomalies and model description. We shall illustrate the above with concrete examples from the wild!]]> Thu, 25 Jun 2020 05:44:20 GMT /slideshow/anomaly-detection-at-the-edge/236189536 arunkejariwal@slideshare.net(arunkejariwal) Anomaly Detection At The Edge arunkejariwal In the wake of IoT becoming ubiquitous, there has been a large interest in the industry to develop novel techniques for anomaly detection at the Edge. Example applications include, but not limited to, smart cities/grids of sensors, industrial process control in manufacturing, smart home, wearables, connected vehicles, agriculture (sensing for soil moisture and nutrients). What makes anomaly detection at the Edge different? The following constraints be it due to the sensors or the applications necessitate the need for the development of new algorithms for AD. * Very low power and low compute/memory resources * High data volume making centralized AD infeasible owing to the communication overhead * Need for low latency to drive fast action taking Guaranteeing privacy In this talk we shall throw light on the above in detail. Subsequently, we shall walk through the algorithm design process for anomaly detection at the Edge. Specifically, we shall dive into the need to build small models/ensembles owing to limited memory on the sensors. Further, how to training data in an online fashion as long term historical data is not available due to limited storage. Given the need for data compression to contain the communication overhead, can one carry out anomaly detection on compressed data? We shall throw light on building of small models, sequential and one-shot learning algorithms, compressing the data with the models and limiting the communication to only the data corresponding to the anomalies and model description. We shall illustrate the above with concrete examples from the wild! <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/adattheedge-200625054421-thumbnail.jpg?width=120&height=120&fit=bounds" /> In the wake of IoT becoming ubiquitous, there has been a large interest in the industry to develop novel techniques for anomaly detection at the Edge. Example applications include, but not limited to, smart cities/grids of sensors, industrial process control in manufacturing, smart home, wearables, connected vehicles, agriculture (sensing for soil moisture and nutrients). What makes anomaly detection at the Edge different? The following constraints be it due to the sensors or the applications necessitate the need for the development of new algorithms for AD. * Very low power and low compute/memory resources * High data volume making centralized AD infeasible owing to the communication overhead * Need for low latency to drive fast action taking Guaranteeing privacy In this talk we shall throw light on the above in detail. Subsequently, we shall walk through the algorithm design process for anomaly detection at the Edge. Specifically, we shall dive into the need to build small models/ensembles owing to limited memory on the sensors. Further, how to training data in an online fashion as long term historical data is not available due to limited storage. Given the need for data compression to contain the communication overhead, can one carry out anomaly detection on compressed data? We shall throw light on building of small models, sequential and one-shot learning algorithms, compressing the data with the models and limiting the communication to only the data corresponding to the anomalies and model description. We shall illustrate the above with concrete examples from the wild!

Anomaly Detection At The Edge from Arun Kejariwal

]]> 627 0 https://cdn.slidesharecdn.com/ss_thumbnails/adattheedge-200625054421-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Serverless Streaming Architectures and Algorithms for the Enterprise /slideshow/serverless-streaming-architectures-and-algorithms-for-the-enterprise-175954094/175954094 serverlessstreamingoptimized-190925115453
In recent years, serverless has gained momentum in the realm of cloud computing. Broadly speaking, it comprises function as a service (FaaS) and backend as a service (BaaS). The distinction between the two is that under FaaS, one writes and maintains the code (e.g., the functions) for serverless compute; in contrast, under BaaS, the platform provides the functionality and manages the operational complexity behind it. Serverless provides a great means to boost development velocity. With greatly reduced infrastructure costs, more agile and focused teams, and faster time to market, enterprises are increasingly adopting serverless approaches to gain a key advantage over their competitors. Example early use cases of serverless include, for example, data transformation in batch and ETL scenarios and data processing using MapReduce patterns. As a natural extension, serverless is being used in the streaming context such as, but not limited to, real-time bidding, fraud detection, intrusion detection. Serverless is, arguably, naturally suited to extracting insights from fast data, that is, high-volume, high-velocity data. Example tasks in this regard include filtering and reducing noise in the data and leveraging machine learning and deep learning models to provide continuous insights about business operations. We walk the audience through the landscape of streaming systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage. We overview the inception and growth of the serverless paradigm. Further, we deep dive into Apache Pulsar, which provides native serverless support in the form of Pulsar functions, and paint a bird’s-eye view of the application domains where Pulsar functions can be leveraged. Baking in intelligence in a serverless flow is paramount from a business perspective. To this end, we detail different serverless patterns—event processing, machine learning, and analytics—for different use cases and highlight the trade-offs. We present perspectives on how advances in hardware technology and the emergence of new applications will impact the evolution of serverless streaming architectures and algorithms. The topics covered include an introduction to st reaming, an introduction to serverless, serverless and streaming requirements, Apache Pulsar, application domains, serverless event processing patterns, serverless machine learning patterns, and serverless analytics patterns.]]>
In recent years, serverless has gained momentum in the realm of cloud computing. Broadly speaking, it comprises function as a service (FaaS) and backend as a service (BaaS). The distinction between the two is that under FaaS, one writes and maintains the code (e.g., the functions) for serverless compute; in contrast, under BaaS, the platform provides the functionality and manages the operational complexity behind it. Serverless provides a great means to boost development velocity. With greatly reduced infrastructure costs, more agile and focused teams, and faster time to market, enterprises are increasingly adopting serverless approaches to gain a key advantage over their competitors. Example early use cases of serverless include, for example, data transformation in batch and ETL scenarios and data processing using MapReduce patterns. As a natural extension, serverless is being used in the streaming context such as, but not limited to, real-time bidding, fraud detection, intrusion detection. Serverless is, arguably, naturally suited to extracting insights from fast data, that is, high-volume, high-velocity data. Example tasks in this regard include filtering and reducing noise in the data and leveraging machine learning and deep learning models to provide continuous insights about business operations. We walk the audience through the landscape of streaming systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage. We overview the inception and growth of the serverless paradigm. Further, we deep dive into Apache Pulsar, which provides native serverless support in the form of Pulsar functions, and paint a bird’s-eye view of the application domains where Pulsar functions can be leveraged. Baking in intelligence in a serverless flow is paramount from a business perspective. To this end, we detail different serverless patterns—event processing, machine learning, and analytics—for different use cases and highlight the trade-offs. We present perspectives on how advances in hardware technology and the emergence of new applications will impact the evolution of serverless streaming architectures and algorithms. The topics covered include an introduction to st reaming, an introduction to serverless, serverless and streaming requirements, Apache Pulsar, application domains, serverless event processing patterns, serverless machine learning patterns, and serverless analytics patterns.]]> Wed, 25 Sep 2019 11:54:53 GMT /slideshow/serverless-streaming-architectures-and-algorithms-for-the-enterprise-175954094/175954094 arunkejariwal@slideshare.net(arunkejariwal) Serverless Streaming Architectures and Algorithms for the Enterprise arunkejariwal In recent years, serverless has gained momentum in the realm of cloud computing. Broadly speaking, it comprises function as a service (FaaS) and backend as a service (BaaS). The distinction between the two is that under FaaS, one writes and maintains the code (e.g., the functions) for serverless compute; in contrast, under BaaS, the platform provides the functionality and manages the operational complexity behind it. Serverless provides a great means to boost development velocity. With greatly reduced infrastructure costs, more agile and focused teams, and faster time to market, enterprises are increasingly adopting serverless approaches to gain a key advantage over their competitors. Example early use cases of serverless include, for example, data transformation in batch and ETL scenarios and data processing using MapReduce patterns. As a natural extension, serverless is being used in the streaming context such as, but not limited to, real-time bidding, fraud detection, intrusion detection. Serverless is, arguably, naturally suited to extracting insights from fast data, that is, high-volume, high-velocity data. Example tasks in this regard include filtering and reducing noise in the data and leveraging machine learning and deep learning models to provide continuous insights about business operations. We walk the audience through the landscape of streaming systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage. We overview the inception and growth of the serverless paradigm. Further, we deep dive into Apache Pulsar, which provides native serverless support in the form of Pulsar functions, and paint a bird’s-eye view of the application domains where Pulsar functions can be leveraged. Baking in intelligence in a serverless flow is paramount from a business perspective. To this end, we detail different serverless patterns—event processing, machine learning, and analytics—for different use cases and highlight the trade-offs. We present perspectives on how advances in hardware technology and the emergence of new applications will impact the evolution of serverless streaming architectures and algorithms. The topics covered include an introduction to st reaming, an introduction to serverless, serverless and streaming requirements, Apache Pulsar, application domains, serverless event processing patterns, serverless machine learning patterns, and serverless analytics patterns. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/serverlessstreamingoptimized-190925115453-thumbnail.jpg?width=120&height=120&fit=bounds" /> In recent years, serverless has gained momentum in the realm of cloud computing. Broadly speaking, it comprises function as a service (FaaS) and backend as a service (BaaS). The distinction between the two is that under FaaS, one writes and maintains the code (e.g., the functions) for serverless compute; in contrast, under BaaS, the platform provides the functionality and manages the operational complexity behind it. Serverless provides a great means to boost development velocity. With greatly reduced infrastructure costs, more agile and focused teams, and faster time to market, enterprises are increasingly adopting serverless approaches to gain a key advantage over their competitors. Example early use cases of serverless include, for example, data transformation in batch and ETL scenarios and data processing using MapReduce patterns. As a natural extension, serverless is being used in the streaming context such as, but not limited to, real-time bidding, fraud detection, intrusion detection. Serverless is, arguably, naturally suited to extracting insights from fast data, that is, high-volume, high-velocity data. Example tasks in this regard include filtering and reducing noise in the data and leveraging machine learning and deep learning models to provide continuous insights about business operations. We walk the audience through the landscape of streaming systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage. We overview the inception and growth of the serverless paradigm. Further, we deep dive into Apache Pulsar, which provides native serverless support in the form of Pulsar functions, and paint a bird’s-eye view of the application domains where Pulsar functions can be leveraged. Baking in intelligence in a serverless flow is paramount from a business perspective. To this end, we detail different serverless patterns—event processing, machine learning, and analytics—for different use cases and highlight the trade-offs. We present perspectives on how advances in hardware technology and the emergence of new applications will impact the evolution of serverless streaming architectures and algorithms. The topics covered include an introduction to st reaming, an introduction to serverless, serverless and streaming requirements, Apache Pulsar, application domains, serverless event processing patterns, serverless machine learning patterns, and serverless analytics patterns.

Serverless Streaming Architectures and Algorithms for the Enterprise from Arun Kejariwal

]]> 2892 8 https://cdn.slidesharecdn.com/ss_thumbnails/serverlessstreamingoptimized-190925115453-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Sequence-to-Sequence Modeling for Time Series /slideshow/sequencetosequence-modeling-for-time-series-143646650/143646650 oreillystratalondon2019combinedfull-190504150435
In this talk we overview Sequence-2-Sequence (S2S) and explore its early use cases. We walk the audience through how to leverage S2S modeling for several use cases, particularly with regard to real-time anomaly detection and forecasting. ]]>
In this talk we overview Sequence-2-Sequence (S2S) and explore its early use cases. We walk the audience through how to leverage S2S modeling for several use cases, particularly with regard to real-time anomaly detection and forecasting. ]]> Sat, 04 May 2019 15:04:35 GMT /slideshow/sequencetosequence-modeling-for-time-series-143646650/143646650 arunkejariwal@slideshare.net(arunkejariwal) Sequence-to-Sequence Modeling for Time Series arunkejariwal In this talk we overview Sequence-2-Sequence (S2S) and explore its early use cases. We walk the audience through how to leverage S2S modeling for several use cases, particularly with regard to real-time anomaly detection and forecasting. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/oreillystratalondon2019combinedfull-190504150435-thumbnail.jpg?width=120&height=120&fit=bounds" /> In this talk we overview Sequence-2-Sequence (S2S) and explore its early use cases. We walk the audience through how to leverage S2S modeling for several use cases, particularly with regard to real-time anomaly detection and forecasting.

Sequence-to-Sequence Modeling for Time Series from Arun Kejariwal

]]> 3195 10 https://cdn.slidesharecdn.com/ss_thumbnails/oreillystratalondon2019combinedfull-190504150435-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Sequence-to-Sequence Modeling for Time Series /slideshow/sequencetosequence-modeling-for-time-series/143097168 s2sonlyme-190501162431
Sequence-to-sequence modeling (seq2seq) is now being used for applications based on time series data. We overview Seq-2-Seq and explore its early use cases. They then walk the audience through how to leverage Seq-2-Seq modeling for a couple of concrete use cases - real-time anomaly detection and forecasting.]]>
Sequence-to-sequence modeling (seq2seq) is now being used for applications based on time series data. We overview Seq-2-Seq and explore its early use cases. They then walk the audience through how to leverage Seq-2-Seq modeling for a couple of concrete use cases - real-time anomaly detection and forecasting.]]> Wed, 01 May 2019 16:24:31 GMT /slideshow/sequencetosequence-modeling-for-time-series/143097168 arunkejariwal@slideshare.net(arunkejariwal) Sequence-to-Sequence Modeling for Time Series arunkejariwal Sequence-to-sequence modeling (seq2seq) is now being used for applications based on time series data. We overview Seq-2-Seq and explore its early use cases. They then walk the audience through how to leverage Seq-2-Seq modeling for a couple of concrete use cases - real-time anomaly detection and forecasting. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/s2sonlyme-190501162431-thumbnail.jpg?width=120&height=120&fit=bounds" /> Sequence-to-sequence modeling (seq2seq) is now being used for applications based on time series data. We overview Seq-2-Seq and explore its early use cases. They then walk the audience through how to leverage Seq-2-Seq modeling for a couple of concrete use cases - real-time anomaly detection and forecasting.

Sequence-to-Sequence Modeling for Time Series from Arun Kejariwal

]]> 1932 5 https://cdn.slidesharecdn.com/ss_thumbnails/s2sonlyme-190501162431-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Model Serving via Pulsar Functions /slideshow/model-serving-via-pulsar-functions/143094384 modelserving-190501160119
In this talk we walk through an architecture in which models are served in real time and the models are updated, using Apache Pulsar, without restarting the application at hand. They then describe how to apply Pulsar functions to support two example use—sampling and filtering—and explore a concrete case study of the same.]]>
In this talk we walk through an architecture in which models are served in real time and the models are updated, using Apache Pulsar, without restarting the application at hand. They then describe how to apply Pulsar functions to support two example use—sampling and filtering—and explore a concrete case study of the same.]]> Wed, 01 May 2019 16:01:19 GMT /slideshow/model-serving-via-pulsar-functions/143094384 arunkejariwal@slideshare.net(arunkejariwal) Model Serving via Pulsar Functions arunkejariwal In this talk we walk through an architecture in which models are served in real time and the models are updated, using Apache Pulsar, without restarting the application at hand. They then describe how to apply Pulsar functions to support two example use—sampling and filtering—and explore a concrete case study of the same. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/modelserving-190501160119-thumbnail.jpg?width=120&height=120&fit=bounds" /> In this talk we walk through an architecture in which models are served in real time and the models are updated, using Apache Pulsar, without restarting the application at hand. They then describe how to apply Pulsar functions to support two example use—sampling and filtering—and explore a concrete case study of the same.

Model Serving via Pulsar Functions from Arun Kejariwal

]]> 1672 4 https://cdn.slidesharecdn.com/ss_thumbnails/modelserving-190501160119-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Designing Modern Streaming Data Applications /slideshow/designing-modern-streaming-data-applications-115037555/115037555 modernrealtimeapplicationsfinalv8-180917182431
Many industry segments have been grappling with fast data (high-volume, high-velocity data). The enterprises in these industry segments need to process this fast data just in time to derive insights and act upon it quickly. Such tasks include but are not limited to enriching data with additional information, filtering and reducing noisy data, enhancing machine learning models, providing continuous insights on business operations, and sharing these insights just in time with customers. In order to realize these results, an enterprise needs to build an end-to-end data processing system, from data acquisition, data ingestion, data processing, and model building to serving and sharing the results. This presents a significant challenge, due to the presence of multiple messaging frameworks and several streaming computing frameworks and storage frameworks for real-time data. In this tutorial we lead a journey through the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline, messaging frameworks, streaming computing frameworks, storage frameworks for real-time data, and more. We also share case studies from the IoT, gaming, and healthcare as well as their experience operating these systems at internet scale at Twitter and Yahoo. We conclude by offering their perspectives on how advances in hardware technology and the emergence of new applications will impact the evolution of messaging systems, streaming systems, storage systems for streaming data, and reinforcement learning-based systems that will power fast processing and analysis of a large (potentially of the order of hundreds of millions) set of data streams. Topics include: * An introduction to streaming * Common data processing patterns * Different types of end-to-end stream processing architectures * How to seamlessly move data across data different frameworks * Case studies: Healthcare and the IoT * Data sketches for mining insights from data streams]]>
Many industry segments have been grappling with fast data (high-volume, high-velocity data). The enterprises in these industry segments need to process this fast data just in time to derive insights and act upon it quickly. Such tasks include but are not limited to enriching data with additional information, filtering and reducing noisy data, enhancing machine learning models, providing continuous insights on business operations, and sharing these insights just in time with customers. In order to realize these results, an enterprise needs to build an end-to-end data processing system, from data acquisition, data ingestion, data processing, and model building to serving and sharing the results. This presents a significant challenge, due to the presence of multiple messaging frameworks and several streaming computing frameworks and storage frameworks for real-time data. In this tutorial we lead a journey through the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline, messaging frameworks, streaming computing frameworks, storage frameworks for real-time data, and more. We also share case studies from the IoT, gaming, and healthcare as well as their experience operating these systems at internet scale at Twitter and Yahoo. We conclude by offering their perspectives on how advances in hardware technology and the emergence of new applications will impact the evolution of messaging systems, streaming systems, storage systems for streaming data, and reinforcement learning-based systems that will power fast processing and analysis of a large (potentially of the order of hundreds of millions) set of data streams. Topics include: * An introduction to streaming * Common data processing patterns * Different types of end-to-end stream processing architectures * How to seamlessly move data across data different frameworks * Case studies: Healthcare and the IoT * Data sketches for mining insights from data streams]]> Mon, 17 Sep 2018 18:24:31 GMT /slideshow/designing-modern-streaming-data-applications-115037555/115037555 arunkejariwal@slideshare.net(arunkejariwal) Designing Modern Streaming Data Applications arunkejariwal Many industry segments have been grappling with fast data (high-volume, high-velocity data). The enterprises in these industry segments need to process this fast data just in time to derive insights and act upon it quickly. Such tasks include but are not limited to enriching data with additional information, filtering and reducing noisy data, enhancing machine learning models, providing continuous insights on business operations, and sharing these insights just in time with customers. In order to realize these results, an enterprise needs to build an end-to-end data processing system, from data acquisition, data ingestion, data processing, and model building to serving and sharing the results. This presents a significant challenge, due to the presence of multiple messaging frameworks and several streaming computing frameworks and storage frameworks for real-time data. In this tutorial we lead a journey through the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline, messaging frameworks, streaming computing frameworks, storage frameworks for real-time data, and more. We also share case studies from the IoT, gaming, and healthcare as well as their experience operating these systems at internet scale at Twitter and Yahoo. We conclude by offering their perspectives on how advances in hardware technology and the emergence of new applications will impact the evolution of messaging systems, streaming systems, storage systems for streaming data, and reinforcement learning-based systems that will power fast processing and analysis of a large (potentially of the order of hundreds of millions) set of data streams. Topics include: * An introduction to streaming * Common data processing patterns * Different types of end-to-end stream processing architectures * How to seamlessly move data across data different frameworks * Case studies: Healthcare and the IoT * Data sketches for mining insights from data streams <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/modernrealtimeapplicationsfinalv8-180917182431-thumbnail.jpg?width=120&height=120&fit=bounds" /> Many industry segments have been grappling with fast data (high-volume, high-velocity data). The enterprises in these industry segments need to process this fast data just in time to derive insights and act upon it quickly. Such tasks include but are not limited to enriching data with additional information, filtering and reducing noisy data, enhancing machine learning models, providing continuous insights on business operations, and sharing these insights just in time with customers. In order to realize these results, an enterprise needs to build an end-to-end data processing system, from data acquisition, data ingestion, data processing, and model building to serving and sharing the results. This presents a significant challenge, due to the presence of multiple messaging frameworks and several streaming computing frameworks and storage frameworks for real-time data. In this tutorial we lead a journey through the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline, messaging frameworks, streaming computing frameworks, storage frameworks for real-time data, and more. We also share case studies from the IoT, gaming, and healthcare as well as their experience operating these systems at internet scale at Twitter and Yahoo. We conclude by offering their perspectives on how advances in hardware technology and the emergence of new applications will impact the evolution of messaging systems, streaming systems, storage systems for streaming data, and reinforcement learning-based systems that will power fast processing and analysis of a large (potentially of the order of hundreds of millions) set of data streams. Topics include: * An introduction to streaming * Common data processing patterns * Different types of end-to-end stream processing architectures * How to seamlessly move data across data different frameworks * Case studies: Healthcare and the IoT * Data sketches for mining insights from data streams

Designing Modern Streaming Data Applications from Arun Kejariwal

]]> 2661 11 https://cdn.slidesharecdn.com/ss_thumbnails/modernrealtimeapplicationsfinalv8-180917182431-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Correlation Analysis on Live Data Streams /slideshow/correlation-analysis-on-live-data-streams-114511782/114511782 correlationanalysisstratanyc2018-180914172532
There has been a shift from big data to live streaming data to facilitate faster data-driven decision making. As the number of live data streams grow—partly a result of the expanding IoT—it is critical to develop techniques to better extract actionable insights. One current application, anomaly detection, is a necessary but insufficient step, due to the fact that anomaly detection over a set of live data streams may result in an anomaly fatigue, limiting effective decision making. One way to address the above is to carry out anomaly detection in a multidimensional space. However, this is typically very expensive computationally and hence not suitable for live data streams. Another approach is to carry out anomaly detection on individual data streams and then leverage correlation analysis to minimize false positives, which in turn helps in surfacing actionable insights faster. In this talk, we explain how marrying correlation analysis with anomaly detection can help and share techniques to guide effective decision making. Topics include: * An overview correlation analysis * Robust correlation analysis * Overview of alternative measures, such as co-median * Trade-offs between speed and accuracy * Correlation analysis in large dimensions]]>
There has been a shift from big data to live streaming data to facilitate faster data-driven decision making. As the number of live data streams grow—partly a result of the expanding IoT—it is critical to develop techniques to better extract actionable insights. One current application, anomaly detection, is a necessary but insufficient step, due to the fact that anomaly detection over a set of live data streams may result in an anomaly fatigue, limiting effective decision making. One way to address the above is to carry out anomaly detection in a multidimensional space. However, this is typically very expensive computationally and hence not suitable for live data streams. Another approach is to carry out anomaly detection on individual data streams and then leverage correlation analysis to minimize false positives, which in turn helps in surfacing actionable insights faster. In this talk, we explain how marrying correlation analysis with anomaly detection can help and share techniques to guide effective decision making. Topics include: * An overview correlation analysis * Robust correlation analysis * Overview of alternative measures, such as co-median * Trade-offs between speed and accuracy * Correlation analysis in large dimensions]]> Fri, 14 Sep 2018 17:25:32 GMT /slideshow/correlation-analysis-on-live-data-streams-114511782/114511782 arunkejariwal@slideshare.net(arunkejariwal) Correlation Analysis on Live Data Streams arunkejariwal There has been a shift from big data to live streaming data to facilitate faster data-driven decision making. As the number of live data streams grow—partly a result of the expanding IoT—it is critical to develop techniques to better extract actionable insights. One current application, anomaly detection, is a necessary but insufficient step, due to the fact that anomaly detection over a set of live data streams may result in an anomaly fatigue, limiting effective decision making. One way to address the above is to carry out anomaly detection in a multidimensional space. However, this is typically very expensive computationally and hence not suitable for live data streams. Another approach is to carry out anomaly detection on individual data streams and then leverage correlation analysis to minimize false positives, which in turn helps in surfacing actionable insights faster. In this talk, we explain how marrying correlation analysis with anomaly detection can help and share techniques to guide effective decision making. Topics include: * An overview correlation analysis * Robust correlation analysis * Overview of alternative measures, such as co-median * Trade-offs between speed and accuracy * Correlation analysis in large dimensions <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/correlationanalysisstratanyc2018-180914172532-thumbnail.jpg?width=120&height=120&fit=bounds" /> There has been a shift from big data to live streaming data to facilitate faster data-driven decision making. As the number of live data streams grow—partly a result of the expanding IoT—it is critical to develop techniques to better extract actionable insights. One current application, anomaly detection, is a necessary but insufficient step, due to the fact that anomaly detection over a set of live data streams may result in an anomaly fatigue, limiting effective decision making. One way to address the above is to carry out anomaly detection in a multidimensional space. However, this is typically very expensive computationally and hence not suitable for live data streams. Another approach is to carry out anomaly detection on individual data streams and then leverage correlation analysis to minimize false positives, which in turn helps in surfacing actionable insights faster. In this talk, we explain how marrying correlation analysis with anomaly detection can help and share techniques to guide effective decision making. Topics include: * An overview correlation analysis * Robust correlation analysis * Overview of alternative measures, such as co-median * Trade-offs between speed and accuracy * Correlation analysis in large dimensions

Correlation Analysis on Live Data Streams from Arun Kejariwal

]]> 338 7 https://cdn.slidesharecdn.com/ss_thumbnails/correlationanalysisstratanyc2018-180914172532-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Deep Learning for Time Series Data /slideshow/deep-learning-for-time-series-data-113858051/113858051 oreillyaisf2018-180911000919
In this talk we walk the audience through how to marry correlation analysis with anomaly detection, discuss how the topics are intertwined, and detail the challenges one may encounter based on production data. We also showcase how deep learning can be leveraged to learn nonlinear correlation, which in turn can be used to further contain the false positive rate of an anomaly detection system. Further, we provide an overview of how correlation can be leveraged for common representation learning. ]]>
In this talk we walk the audience through how to marry correlation analysis with anomaly detection, discuss how the topics are intertwined, and detail the challenges one may encounter based on production data. We also showcase how deep learning can be leveraged to learn nonlinear correlation, which in turn can be used to further contain the false positive rate of an anomaly detection system. Further, we provide an overview of how correlation can be leveraged for common representation learning. ]]> Tue, 11 Sep 2018 00:09:19 GMT /slideshow/deep-learning-for-time-series-data-113858051/113858051 arunkejariwal@slideshare.net(arunkejariwal) Deep Learning for Time Series Data arunkejariwal In this talk we walk the audience through how to marry correlation analysis with anomaly detection, discuss how the topics are intertwined, and detail the challenges one may encounter based on production data. We also showcase how deep learning can be leveraged to learn nonlinear correlation, which in turn can be used to further contain the false positive rate of an anomaly detection system. Further, we provide an overview of how correlation can be leveraged for common representation learning. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/oreillyaisf2018-180911000919-thumbnail.jpg?width=120&height=120&fit=bounds" /> In this talk we walk the audience through how to marry correlation analysis with anomaly detection, discuss how the topics are intertwined, and detail the challenges one may encounter based on production data. We also showcase how deep learning can be leveraged to learn nonlinear correlation, which in turn can be used to further contain the false positive rate of an anomaly detection system. Further, we provide an overview of how correlation can be leveraged for common representation learning.

Deep Learning for Time Series Data from Arun Kejariwal

]]> 1841 4 https://cdn.slidesharecdn.com/ss_thumbnails/oreillyaisf2018-180911000919-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Correlation Analysis on Live Data Streams /slideshow/correlation-analysis-on-live-data-streams/98597865 correlationanalysis-180525023811
There has been a shift from big data to live streaming data to facilitate faster data-driven decision making. As the number of live data streams grow—partly a result of the expanding IoT—it is critical to develop techniques to better extract actionable insights. One current application, anomaly detection, is a necessary but insufficient step, due to the fact that anomaly detection over a set of live data streams may result in an anomaly fatigue, limiting effective decision making. One way to address the above is to carry out anomaly detection in a multidimensional space. However, this is typically very expensive computationally and hence not suitable for live data streams. Another approach is to carry out anomaly detection on individual data streams and then leverage correlation analysis to minimize false positives, which in turn helps in surfacing actionable insights faster. In this talk we explain how marrying correlation analysis with anomaly detection can help and share techniques to guide effective decision making. Topics include: * An overview correlation analysis * Robust correlation analysis * Trade-offs between speed and accuracy * Multi-modal correlation analysis]]>
There has been a shift from big data to live streaming data to facilitate faster data-driven decision making. As the number of live data streams grow—partly a result of the expanding IoT—it is critical to develop techniques to better extract actionable insights. One current application, anomaly detection, is a necessary but insufficient step, due to the fact that anomaly detection over a set of live data streams may result in an anomaly fatigue, limiting effective decision making. One way to address the above is to carry out anomaly detection in a multidimensional space. However, this is typically very expensive computationally and hence not suitable for live data streams. Another approach is to carry out anomaly detection on individual data streams and then leverage correlation analysis to minimize false positives, which in turn helps in surfacing actionable insights faster. In this talk we explain how marrying correlation analysis with anomaly detection can help and share techniques to guide effective decision making. Topics include: * An overview correlation analysis * Robust correlation analysis * Trade-offs between speed and accuracy * Multi-modal correlation analysis]]> Fri, 25 May 2018 02:38:11 GMT /slideshow/correlation-analysis-on-live-data-streams/98597865 arunkejariwal@slideshare.net(arunkejariwal) Correlation Analysis on Live Data Streams arunkejariwal There has been a shift from big data to live streaming data to facilitate faster data-driven decision making. As the number of live data streams grow—partly a result of the expanding IoT—it is critical to develop techniques to better extract actionable insights. One current application, anomaly detection, is a necessary but insufficient step, due to the fact that anomaly detection over a set of live data streams may result in an anomaly fatigue, limiting effective decision making. One way to address the above is to carry out anomaly detection in a multidimensional space. However, this is typically very expensive computationally and hence not suitable for live data streams. Another approach is to carry out anomaly detection on individual data streams and then leverage correlation analysis to minimize false positives, which in turn helps in surfacing actionable insights faster. In this talk we explain how marrying correlation analysis with anomaly detection can help and share techniques to guide effective decision making. Topics include: * An overview correlation analysis * Robust correlation analysis * Trade-offs between speed and accuracy * Multi-modal correlation analysis <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/correlationanalysis-180525023811-thumbnail.jpg?width=120&height=120&fit=bounds" /> There has been a shift from big data to live streaming data to facilitate faster data-driven decision making. As the number of live data streams grow—partly a result of the expanding IoT—it is critical to develop techniques to better extract actionable insights. One current application, anomaly detection, is a necessary but insufficient step, due to the fact that anomaly detection over a set of live data streams may result in an anomaly fatigue, limiting effective decision making. One way to address the above is to carry out anomaly detection in a multidimensional space. However, this is typically very expensive computationally and hence not suitable for live data streams. Another approach is to carry out anomaly detection on individual data streams and then leverage correlation analysis to minimize false positives, which in turn helps in surfacing actionable insights faster. In this talk we explain how marrying correlation analysis with anomaly detection can help and share techniques to guide effective decision making. Topics include: * An overview correlation analysis * Robust correlation analysis * Trade-offs between speed and accuracy * Multi-modal correlation analysis

Correlation Analysis on Live Data Streams from Arun Kejariwal

]]> 2090 8 https://cdn.slidesharecdn.com/ss_thumbnails/correlationanalysis-180525023811-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Live Anomaly Detection /slideshow/live-anomaly-detection-80287265/80287265 liveanomalydetection-170929112939
compute tier. Detection and filtering of anomalies in live data is of paramount importance for robust decision making. To this end, in this talk we share techniques for anomaly detection in live data. ]]>
compute tier. Detection and filtering of anomalies in live data is of paramount importance for robust decision making. To this end, in this talk we share techniques for anomaly detection in live data. ]]> Fri, 29 Sep 2017 11:29:39 GMT /slideshow/live-anomaly-detection-80287265/80287265 arunkejariwal@slideshare.net(arunkejariwal) Live Anomaly Detection arunkejariwal compute tier. Detection and filtering of anomalies in live data is of paramount importance for robust decision making. To this end, in this talk we share techniques for anomaly detection in live data. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/liveanomalydetection-170929112939-thumbnail.jpg?width=120&height=120&fit=bounds" /> compute tier. Detection and filtering of anomalies in live data is of paramount importance for robust decision making. To this end, in this talk we share techniques for anomaly detection in live data.

Live Anomaly Detection from Arun Kejariwal

]]> 2990 6 https://cdn.slidesharecdn.com/ss_thumbnails/liveanomalydetection-170929112939-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Modern real-time streaming architectures /slideshow/modern-realtime-streaming-architectures/80199031 modernreal-timestreamingarchitectures-170927025704
In this tutorial we walk through state-of-the-art streaming systems, algorithms, and deployment architectures and cover the typical challenges in modern real-time big data platforms and offering insights on how to address them. We also discuss how advances in technology might impact the streaming architectures and applications of the future. Along the way, we explore the interplay between storage and stream processing and discuss future developments.]]>
In this tutorial we walk through state-of-the-art streaming systems, algorithms, and deployment architectures and cover the typical challenges in modern real-time big data platforms and offering insights on how to address them. We also discuss how advances in technology might impact the streaming architectures and applications of the future. Along the way, we explore the interplay between storage and stream processing and discuss future developments.]]> Wed, 27 Sep 2017 02:57:04 GMT /slideshow/modern-realtime-streaming-architectures/80199031 arunkejariwal@slideshare.net(arunkejariwal) Modern real-time streaming architectures arunkejariwal In this tutorial we walk through state-of-the-art streaming systems, algorithms, and deployment architectures and cover the typical challenges in modern real-time big data platforms and offering insights on how to address them. We also discuss how advances in technology might impact the streaming architectures and applications of the future. Along the way, we explore the interplay between storage and stream processing and discuss future developments. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/modernreal-timestreamingarchitectures-170927025704-thumbnail.jpg?width=120&height=120&fit=bounds" /> In this tutorial we walk through state-of-the-art streaming systems, algorithms, and deployment architectures and cover the typical challenges in modern real-time big data platforms and offering insights on how to address them. We also discuss how advances in technology might impact the streaming architectures and applications of the future. Along the way, we explore the interplay between storage and stream processing and discuss future developments.

Modern real-time streaming architectures from Arun Kejariwal

]]> 7214 18 https://cdn.slidesharecdn.com/ss_thumbnails/modernreal-timestreamingarchitectures-170927025704-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Anomaly detection in real-time data streams using Heron /slideshow/anomaly-detection-in-realtime-data-streams-using-heron/73259550 adforheronv2-170317165431
Twitter has become the de facto medium for consumption of news in real time, and billions of events are generated and analyzed on a daily basis. To analyze these events, Twitter designed its own next-generation streaming system, Heron. Arun Kejariwal and Karthik Ramasamy walk you through how Heron is used to detect anomalies in real-time data streams. Although there’s been over 75 years of prior work in anomaly detection, most of the techniques cannot be used off the shelf because they’re not suitable for high-velocity data streams. Arun and Karthik explain how to make trade-offs between accuracy and speed and discuss incremental approaches that marry sampling with robust measures such as median and MCD for anomaly detection.]]>
Twitter has become the de facto medium for consumption of news in real time, and billions of events are generated and analyzed on a daily basis. To analyze these events, Twitter designed its own next-generation streaming system, Heron. Arun Kejariwal and Karthik Ramasamy walk you through how Heron is used to detect anomalies in real-time data streams. Although there’s been over 75 years of prior work in anomaly detection, most of the techniques cannot be used off the shelf because they’re not suitable for high-velocity data streams. Arun and Karthik explain how to make trade-offs between accuracy and speed and discuss incremental approaches that marry sampling with robust measures such as median and MCD for anomaly detection.]]> Fri, 17 Mar 2017 16:54:30 GMT /slideshow/anomaly-detection-in-realtime-data-streams-using-heron/73259550 arunkejariwal@slideshare.net(arunkejariwal) Anomaly detection in real-time data streams using Heron arunkejariwal Twitter has become the de facto medium for consumption of news in real time, and billions of events are generated and analyzed on a daily basis. To analyze these events, Twitter designed its own next-generation streaming system, Heron. Arun Kejariwal and Karthik Ramasamy walk you through how Heron is used to detect anomalies in real-time data streams. Although there’s been over 75 years of prior work in anomaly detection, most of the techniques cannot be used off the shelf because they’re not suitable for high-velocity data streams. Arun and Karthik explain how to make trade-offs between accuracy and speed and discuss incremental approaches that marry sampling with robust measures such as median and MCD for anomaly detection. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/adforheronv2-170317165431-thumbnail.jpg?width=120&height=120&fit=bounds" /> Twitter has become the de facto medium for consumption of news in real time, and billions of events are generated and analyzed on a daily basis. To analyze these events, Twitter designed its own next-generation streaming system, Heron. Arun Kejariwal and Karthik Ramasamy walk you through how Heron is used to detect anomalies in real-time data streams. Although there’s been over 75 years of prior work in anomaly detection, most of the techniques cannot be used off the shelf because they’re not suitable for high-velocity data streams. Arun and Karthik explain how to make trade-offs between accuracy and speed and discuss incremental approaches that marry sampling with robust measures such as median and MCD for anomaly detection.

Anomaly detection in real-time data streams using Heron from Arun Kejariwal

]]> 4705 10 https://cdn.slidesharecdn.com/ss_thumbnails/adforheronv2-170317165431-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Data Data Everywhere: Not An Insight to Take Action Upon /slideshow/data-data-everywhere-not-an-insight-to-take-action-upon/68420802 datadataeverywhere-161108182208
The big data era is characterized by ever-increasing velocity and volume of data. Over the last two or three years, several talks at Velocity have explored how to analyze operations data at scale, focusing on anomaly detection, performance analysis, and capacity planning, to name a few topics. Knowledge sharing of the techniques for the aforementioned problems helps the community to build highly available, performant, and resilient systems. A key aspect of operations data is that data may be missing—referred to as “holes”—in the time series. This may happen for a wide variety of reasons, including (but not limited to): # Packets being dropped due to unresponsive downstream services # A network hiccup # Transient hardware or software failure # An issue with the data collection service “Holes” in the time series on data analysis can potentially skew the analysis of data. This in turn can materially impact decision making. Arun Kejariwal presents approaches for analyzing operations data in the presence of “holes” in the time series, highlighting how missing data impacts common data analysis such as anomaly detection and forecasting, discussing the implications of missing data on time series of different granularities, such as minutely and hourly, and exploring a gamut of techniques that can be used to address the missing data issue (e.g., approximate the data using interpolation, regression, ensemble methods, etc.). Arun then walks you through how the techniques can be leveraged using real data.]]>
The big data era is characterized by ever-increasing velocity and volume of data. Over the last two or three years, several talks at Velocity have explored how to analyze operations data at scale, focusing on anomaly detection, performance analysis, and capacity planning, to name a few topics. Knowledge sharing of the techniques for the aforementioned problems helps the community to build highly available, performant, and resilient systems. A key aspect of operations data is that data may be missing—referred to as “holes”—in the time series. This may happen for a wide variety of reasons, including (but not limited to): # Packets being dropped due to unresponsive downstream services # A network hiccup # Transient hardware or software failure # An issue with the data collection service “Holes” in the time series on data analysis can potentially skew the analysis of data. This in turn can materially impact decision making. Arun Kejariwal presents approaches for analyzing operations data in the presence of “holes” in the time series, highlighting how missing data impacts common data analysis such as anomaly detection and forecasting, discussing the implications of missing data on time series of different granularities, such as minutely and hourly, and exploring a gamut of techniques that can be used to address the missing data issue (e.g., approximate the data using interpolation, regression, ensemble methods, etc.). Arun then walks you through how the techniques can be leveraged using real data.]]> Tue, 08 Nov 2016 18:22:08 GMT /slideshow/data-data-everywhere-not-an-insight-to-take-action-upon/68420802 arunkejariwal@slideshare.net(arunkejariwal) Data Data Everywhere: Not An Insight to Take Action Upon arunkejariwal The big data era is characterized by ever-increasing velocity and volume of data. Over the last two or three years, several talks at Velocity have explored how to analyze operations data at scale, focusing on anomaly detection, performance analysis, and capacity planning, to name a few topics. Knowledge sharing of the techniques for the aforementioned problems helps the community to build highly available, performant, and resilient systems. A key aspect of operations data is that data may be missing—referred to as “holes”—in the time series. This may happen for a wide variety of reasons, including (but not limited to): # Packets being dropped due to unresponsive downstream services # A network hiccup # Transient hardware or software failure # An issue with the data collection service “Holes” in the time series on data analysis can potentially skew the analysis of data. This in turn can materially impact decision making. Arun Kejariwal presents approaches for analyzing operations data in the presence of “holes” in the time series, highlighting how missing data impacts common data analysis such as anomaly detection and forecasting, discussing the implications of missing data on time series of different granularities, such as minutely and hourly, and exploring a gamut of techniques that can be used to address the missing data issue (e.g., approximate the data using interpolation, regression, ensemble methods, etc.). Arun then walks you through how the techniques can be leveraged using real data. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/datadataeverywhere-161108182208-thumbnail.jpg?width=120&height=120&fit=bounds" /> The big data era is characterized by ever-increasing velocity and volume of data. Over the last two or three years, several talks at Velocity have explored how to analyze operations data at scale, focusing on anomaly detection, performance analysis, and capacity planning, to name a few topics. Knowledge sharing of the techniques for the aforementioned problems helps the community to build highly available, performant, and resilient systems. A key aspect of operations data is that data may be missing—referred to as “holes”—in the time series. This may happen for a wide variety of reasons, including (but not limited to): # Packets being dropped due to unresponsive downstream services # A network hiccup # Transient hardware or software failure # An issue with the data collection service “Holes” in the time series on data analysis can potentially skew the analysis of data. This in turn can materially impact decision making. Arun Kejariwal presents approaches for analyzing operations data in the presence of “holes” in the time series, highlighting how missing data impacts common data analysis such as anomaly detection and forecasting, discussing the implications of missing data on time series of different granularities, such as minutely and hourly, and exploring a gamut of techniques that can be used to address the missing data issue (e.g., approximate the data using interpolation, regression, ensemble methods, etc.). Arun then walks you through how the techniques can be leveraged using real data.

Data Data Everywhere: Not An Insight to Take Action Upon from Arun Kejariwal

]]> 1555 5 https://cdn.slidesharecdn.com/ss_thumbnails/datadataeverywhere-161108182208-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Real Time Analytics: Algorithms and Systems /slideshow/real-time-analytics-algorithms-and-systems/61560346 realtimeanalyticsv7-160502002511
In this tutorial, an in-depth overview of streaming analytics -- applications, algorithms and platforms -- landscape is presented. We walk through how the field has evolved over the last decade and then discuss the current challenges -- the impact of the other three Vs, viz., Volume, Variety and Veracity, on Big Data streaming analytics. ]]>
In this tutorial, an in-depth overview of streaming analytics -- applications, algorithms and platforms -- landscape is presented. We walk through how the field has evolved over the last decade and then discuss the current challenges -- the impact of the other three Vs, viz., Volume, Variety and Veracity, on Big Data streaming analytics. ]]> Mon, 02 May 2016 00:25:11 GMT /slideshow/real-time-analytics-algorithms-and-systems/61560346 arunkejariwal@slideshare.net(arunkejariwal) Real Time Analytics: Algorithms and Systems arunkejariwal In this tutorial, an in-depth overview of streaming analytics -- applications, algorithms and platforms -- landscape is presented. We walk through how the field has evolved over the last decade and then discuss the current challenges -- the impact of the other three Vs, viz., Volume, Variety and Veracity, on Big Data streaming analytics. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/realtimeanalyticsv7-160502002511-thumbnail.jpg?width=120&height=120&fit=bounds" /> In this tutorial, an in-depth overview of streaming analytics -- applications, algorithms and platforms -- landscape is presented. We walk through how the field has evolved over the last decade and then discuss the current challenges -- the impact of the other three Vs, viz., Volume, Variety and Veracity, on Big Data streaming analytics.

Real Time Analytics: Algorithms and Systems from Arun Kejariwal

]]> 23189 27 https://cdn.slidesharecdn.com/ss_thumbnails/realtimeanalyticsv7-160502002511-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Finding bad apples early: Minimizing performance impact /slideshow/finding-bad-apples-early-minimizing-performance-impact/54546741 velocityamsterdam2015-151029232221-lva1-app6891
The big data era is characterized by the ever-increasing velocity and volume of data. In order to store and analyze the ever-growing data, the operational footprint of data stores and Hadoop have also grown over time. (As per a recent report from IDC, the spending on big data infrastructure is expected to reach $41.5 billion by 2018.) The clusters comprise several thousands of nodes. The high performance of such clusters is vital for delivering the best user experience and productivity of teams. The performance of such clusters is often limited by slow/bad nodes. Finding slow nodes in large clusters is akin to finding a needle in a haystack; hence, manual identification of slow/bad nodes is not practical. To this end, we developed a novel statistical technique to automatically detect slow/bad nodes in clusters comprising hundreds to thousands of nodes. We modeled the problem as a classification problem and employed a simple, yet very effective, distance measure to determine slow/bad nodes. The key highlights of the proposed technique are the following: # Robustness against anomalies (note that anomalies may occur, for example, due to an ad-hoc heavyweight job on a Hadoop cluster) # Given the varying data characteristics of different services, no one model fits all. Consequently, we parameterized the threshold used for classification The proposed technique works well with both hourly and daily data, and has been in use in production by multiple services. This has not only eliminated manual investigation efforts, but has also mitigated the impact of slow nodes, which used to get detected after several weeks/months of lag! We shall walk the audience through how the techniques are being used with REAL data.]]>
The big data era is characterized by the ever-increasing velocity and volume of data. In order to store and analyze the ever-growing data, the operational footprint of data stores and Hadoop have also grown over time. (As per a recent report from IDC, the spending on big data infrastructure is expected to reach $41.5 billion by 2018.) The clusters comprise several thousands of nodes. The high performance of such clusters is vital for delivering the best user experience and productivity of teams. The performance of such clusters is often limited by slow/bad nodes. Finding slow nodes in large clusters is akin to finding a needle in a haystack; hence, manual identification of slow/bad nodes is not practical. To this end, we developed a novel statistical technique to automatically detect slow/bad nodes in clusters comprising hundreds to thousands of nodes. We modeled the problem as a classification problem and employed a simple, yet very effective, distance measure to determine slow/bad nodes. The key highlights of the proposed technique are the following: # Robustness against anomalies (note that anomalies may occur, for example, due to an ad-hoc heavyweight job on a Hadoop cluster) # Given the varying data characteristics of different services, no one model fits all. Consequently, we parameterized the threshold used for classification The proposed technique works well with both hourly and daily data, and has been in use in production by multiple services. This has not only eliminated manual investigation efforts, but has also mitigated the impact of slow nodes, which used to get detected after several weeks/months of lag! We shall walk the audience through how the techniques are being used with REAL data.]]> Thu, 29 Oct 2015 23:22:21 GMT /slideshow/finding-bad-apples-early-minimizing-performance-impact/54546741 arunkejariwal@slideshare.net(arunkejariwal) Finding bad apples early: Minimizing performance impact arunkejariwal The big data era is characterized by the ever-increasing velocity and volume of data. In order to store and analyze the ever-growing data, the operational footprint of data stores and Hadoop have also grown over time. (As per a recent report from IDC, the spending on big data infrastructure is expected to reach $41.5 billion by 2018.) The clusters comprise several thousands of nodes. The high performance of such clusters is vital for delivering the best user experience and productivity of teams. The performance of such clusters is often limited by slow/bad nodes. Finding slow nodes in large clusters is akin to finding a needle in a haystack; hence, manual identification of slow/bad nodes is not practical. To this end, we developed a novel statistical technique to automatically detect slow/bad nodes in clusters comprising hundreds to thousands of nodes. We modeled the problem as a classification problem and employed a simple, yet very effective, distance measure to determine slow/bad nodes. The key highlights of the proposed technique are the following: # Robustness against anomalies (note that anomalies may occur, for example, due to an ad-hoc heavyweight job on a Hadoop cluster) # Given the varying data characteristics of different services, no one model fits all. Consequently, we parameterized the threshold used for classification The proposed technique works well with both hourly and daily data, and has been in use in production by multiple services. This has not only eliminated manual investigation efforts, but has also mitigated the impact of slow nodes, which used to get detected after several weeks/months of lag! We shall walk the audience through how the techniques are being used with REAL data. <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/velocityamsterdam2015-151029232221-lva1-app6891-thumbnail.jpg?width=120&height=120&fit=bounds" /> The big data era is characterized by the ever-increasing velocity and volume of data. In order to store and analyze the ever-growing data, the operational footprint of data stores and Hadoop have also grown over time. (As per a recent report from IDC, the spending on big data infrastructure is expected to reach $41.5 billion by 2018.) The clusters comprise several thousands of nodes. The high performance of such clusters is vital for delivering the best user experience and productivity of teams. The performance of such clusters is often limited by slow/bad nodes. Finding slow nodes in large clusters is akin to finding a needle in a haystack; hence, manual identification of slow/bad nodes is not practical. To this end, we developed a novel statistical technique to automatically detect slow/bad nodes in clusters comprising hundreds to thousands of nodes. We modeled the problem as a classification problem and employed a simple, yet very effective, distance measure to determine slow/bad nodes. The key highlights of the proposed technique are the following: # Robustness against anomalies (note that anomalies may occur, for example, due to an ad-hoc heavyweight job on a Hadoop cluster) # Given the varying data characteristics of different services, no one model fits all. Consequently, we parameterized the threshold used for classification The proposed technique works well with both hourly and daily data, and has been in use in production by multiple services. This has not only eliminated manual investigation efforts, but has also mitigated the impact of slow nodes, which used to get detected after several weeks/months of lag! We shall walk the audience through how the techniques are being used with REAL data.

Finding bad apples early: Minimizing performance impact from Arun Kejariwal

]]> 1072 6 https://cdn.slidesharecdn.com/ss_thumbnails/velocityamsterdam2015-151029232221-lva1-app6891-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Velocity 2015-final /slideshow/velocity-2015final/48738507 velocity-2015-final-150529055200-lva1-app6891
Stream Processing and Anomaly Detection @Twitter]]>
Stream Processing and Anomaly Detection @Twitter]]> Fri, 29 May 2015 05:52:00 GMT /slideshow/velocity-2015final/48738507 arunkejariwal@slideshare.net(arunkejariwal) Velocity 2015-final arunkejariwal Stream Processing and Anomaly Detection @Twitter <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/velocity-2015-final-150529055200-lva1-app6891-thumbnail.jpg?width=120&height=120&fit=bounds" /> Stream Processing and Anomaly Detection @Twitter

Velocity 2015-final from Arun Kejariwal

]]> 2105 1 https://cdn.slidesharecdn.com/ss_thumbnails/velocity-2015-final-150529055200-lva1-app6891-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Statistical Learning Based Anomaly Detection @ Twitter /slideshow/statistical-learning-based-anomaly-detection-twitter/41660402 anomalydetectionvelocitynov2014-141117104202-conversion-gate02
Statistical Learning Based Anomaly Detection @ Twitter]]>
Statistical Learning Based Anomaly Detection @ Twitter]]> Mon, 17 Nov 2014 10:42:02 GMT /slideshow/statistical-learning-based-anomaly-detection-twitter/41660402 arunkejariwal@slideshare.net(arunkejariwal) Statistical Learning Based Anomaly Detection @ Twitter arunkejariwal Statistical Learning Based Anomaly Detection @ Twitter <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/anomalydetectionvelocitynov2014-141117104202-conversion-gate02-thumbnail.jpg?width=120&height=120&fit=bounds" /> Statistical Learning Based Anomaly Detection @ Twitter

Statistical Learning Based Anomaly Detection @ Twitter from Arun Kejariwal

]]> 5260 4 https://cdn.slidesharecdn.com/ss_thumbnails/anomalydetectionvelocitynov2014-141117104202-conversion-gate02-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Days In Green (DIG): Forecasting the life of a healthy service /slideshow/days-in-green-dig-forecasting-the-life-of-a-healthy-service/36364443 digv20-140626225504-phpapp02
]]>
]]> Thu, 26 Jun 2014 22:55:04 GMT /slideshow/days-in-green-dig-forecasting-the-life-of-a-healthy-service/36364443 arunkejariwal@slideshare.net(arunkejariwal) Days In Green (DIG): Forecasting the life of a healthy service arunkejariwal <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/digv20-140626225504-phpapp02-thumbnail.jpg?width=120&height=120&fit=bounds" />

Days In Green (DIG): Forecasting the life of a healthy service from Arun Kejariwal

]]> 796 4 https://cdn.slidesharecdn.com/ss_thumbnails/digv20-140626225504-phpapp02-thumbnail.jpg?width=120&height=120&fit=bounds presentation Black http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 Gimme More! Supporting User Growth in a Performant and Efficient Fashion /slideshow/gimme-more-supporting-user-growth-in-a-performant-and-efficient-fashion/28254882 velocitylondon2013ak-131114141732-phpapp01
]]>
]]> Thu, 14 Nov 2013 14:17:32 GMT /slideshow/gimme-more-supporting-user-growth-in-a-performant-and-efficient-fashion/28254882 arunkejariwal@slideshare.net(arunkejariwal) Gimme More! Supporting User Growth in a Performant and Efficient Fashion arunkejariwal <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/velocitylondon2013ak-131114141732-phpapp01-thumbnail.jpg?width=120&height=120&fit=bounds" />

Gimme More! Supporting User Growth in a Performant and Efficient Fashion from Arun Kejariwal

]]> 2316 4 https://cdn.slidesharecdn.com/ss_thumbnails/velocitylondon2013ak-131114141732-phpapp01-thumbnail.jpg?width=120&height=120&fit=bounds presentation White http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 A Systematic Approach to �Capacity Planning in the Real World /slideshow/a-systematic-approach-to-capacity-planning-in-the-real-world/23213765 velocitysantaclara2013-130619191743-phpapp02
The presentation walks through the high level methodology and details some of the statistical apprache]]>
The presentation walks through the high level methodology and details some of the statistical apprache]]> Wed, 19 Jun 2013 19:17:43 GMT /slideshow/a-systematic-approach-to-capacity-planning-in-the-real-world/23213765 arunkejariwal@slideshare.net(arunkejariwal) A Systematic Approach to �Capacity Planning in the Real World arunkejariwal The presentation walks through the high level methodology and details some of the statistical apprache <img style="border:1px solid #C3E6D8;float:right;" alt="" src="https://cdn.slidesharecdn.com/ss_thumbnails/velocitysantaclara2013-130619191743-phpapp02-thumbnail.jpg?width=120&height=120&fit=bounds" /> The presentation walks through the high level methodology and details some of the statistical apprache

A Systematic Approach to Capacity Planning in the Real World from Arun Kejariwal

]]> 5573 7 https://cdn.slidesharecdn.com/ss_thumbnails/velocitysantaclara2013-130619191743-phpapp02-thumbnail.jpg?width=120&height=120&fit=bounds presentation White http://activitystrea.ms/schema/1.0/post

http://activitystrea.ms/schema/1.0/posted

0 https://cdn.slidesharecdn.com/profile-photo-arunkejariwal-48x48.jpg?cb=1682957933 I have diverse experience in statistical and machine learning, time series analysis, data-driven mobile marketing, software development, and hardware design. I have a strong publication record and am a strong advocate of open source. Have built a team of exceptional researchers from the ground-up. An effective communicator and deft at working with cross-functional teams in different geographies. Seasoned architect with exceptional deep dive analysis skills in both hardware & software and with extensive experience in R&D of novel techniques, based on statistical learning and time series analysis, to address end-user experience/revenue impacting problems. Highly passionate about building... https://cdn.slidesharecdn.com/ss_thumbnails/adattheedge-200625054421-thumbnail.jpg?width=320&height=320&fit=bounds slideshow/anomaly-detection-at-the-edge/236189536 Anomaly Detection At T... https://cdn.slidesharecdn.com/ss_thumbnails/serverlessstreamingoptimized-190925115453-thumbnail.jpg?width=320&height=320&fit=bounds slideshow/serverless-streaming-architectures-and-algorithms-for-the-enterprise-175954094/175954094 Serverless Streaming A... https://cdn.slidesharecdn.com/ss_thumbnails/oreillystratalondon2019combinedfull-190504150435-thumbnail.jpg?width=320&height=320&fit=bounds slideshow/sequencetosequence-modeling-for-time-series-143646650/143646650 Sequence-to-Sequence M...