FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
?
In this session we will present a Configurable FPGA-Based Spark SQL Acceleration Architecture. It is target to leverage FPGA highly parallel computing capability to accelerate Spark SQL Query and for FPGA’s higher power efficiency than CPU we can lower the power consumption at the same time. The Architecture consists of SQL query decomposition algorithms, fine-grained FPGA based Engine Units which perform basic computation of sub string, arithmetic and logic operations. Using SQL query decomposition algorithm, we are able to decompose a complex SQL query into basic operations and according to their patterns each is fed into an Engine Unit. SQL Engine Units are highly configurable and can be chained together to perform complex Spark SQL queries, finally one SQL query is transformed into a Hardware Pipeline. We will present the performance benchmark results comparing the queries with FGPA-Based Spark SQL Acceleration Architecture on XEON E5 and FPGA to the ones with Spark SQL Query on XEON E5 with 10X ~ 100X improvement and we will demonstrate one SQL query workload from a real customer.
Kafka is an open-source distributed commit log service that provides high-throughput messaging functionality. It is designed to handle large volumes of data and different use cases like online and offline processing more efficiently than alternatives like RabbitMQ. Kafka works by partitioning topics into segments spread across clusters of machines, and replicates across these partitions for fault tolerance. It can be used as a central data hub or pipeline for collecting, transforming, and streaming data between systems and applications.
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
?
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
Youtube Link: https://youtu.be/CXTiwkZVoZI
( Microservices Architecture Training: https://www.edureka.co/microservices-architecture-training )
This Edureka's PPT on Spring Boot Interview Questions talks about the top questions asked related to Spring Boot.
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
Introducing Apache Kafka - a visual overview. Presented at the Canberra Big Data Meetup 7 February 2019. We build a Kafka "postal service" to explain the main Kafka concepts, and explain how consumers receive different messages depending on whether there's a key or not.
Celery is an asynchronous task queue/job queue based on distributed message passing. It allows tasks to be executed concurrently on one or more worker servers to minimize request times and offload intensive processes. Some key benefits are improved user experience through faster responses, scalability by adding more workers as needed, and flexibility through many customization points. Celery uses message brokers like RabbitMQ to handle task routing and can integrate with databases, caching, and other services.
In this presentation we describe the design and implementation of Kafka Connect, Kafka’s new tool for scalable, fault-tolerant data import and export. First we’ll discuss some existing tools in the space and why they fall short when applied to data integration at large scale. Next, we will explore Kafka Connect’s design and how it compares to systems with similar goals, discussing key design decisions that trade off between ease of use for connector developers, operational complexity, and reuse of existing connectors. Finally, we’ll discuss how standardizing on Kafka Connect can ultimately lead to simplifying your entire data pipeline, making ETL into your data warehouse and enabling stream processing applications as simple as adding another Kafka connector.
Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks
?
Apache Spark has the ‘speculative execution’ feature to handle the slow tasks in a stage due to environment issues like slow network, disk etc. If one task is running slowly in a stage, Spark driver can launch a speculation task for it on a different host. Between the regular task and its speculation task, Spark system will later take the result from the first successfully completed task and kill the slower one.
When we first enabled the speculation feature for all Spark applications by default on a large cluster of 10K+ nodes at LinkedIn, we observed that the default values set for Spark’s speculation configuration parameters did not work well for LinkedIn’s batch jobs. For example, the system launched too many fruitless speculation tasks (i.e. tasks that were killed later). Besides, the speculation tasks did not help shorten the shuffle stages. In order to reduce the number of fruitless speculation tasks, we tried to find out the root cause, enhanced Spark engine, and tuned the speculation parameters carefully. We analyzed the number of speculation tasks launched, number of fruitful versus fruitless speculation tasks, and their corresponding cpu-memory resource consumption in terms of gigabytes-hours. We were able to reduce the average job response times by 13%, decrease the standard deviation of job elapsed times by 40%, and lower total resource consumption by 24% in a heavily utilized multi-tenant environment on a large cluster. In this talk, we will share our experience on enabling the speculative execution to achieve good job elapsed time reduction at the same time keeping a minimal overhead.
The document discusses Scala, a programming language designed to be scalable. It can be used for both small and large programs. Scala combines object-oriented and functional programming. It interoperates seamlessly with Java but allows developers to add new constructs like actors through libraries. The Scala community is growing, with industrial adoption starting at companies like Twitter.
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
?
Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.
Improving Apache Spark for Dynamic Allocation and Spot InstancesDatabricks
?
This presentation will explore the new work in Spark 3.1 adding the concept of graceful decommissioning and how we can use this to improve Spark’s performance in both dynamic allocation and spot/preemptable instances. Together we’ll explore how Spark’s dynamic allocation has evolved over time, and why the different changes have been needed. We’ll also look at the multi-company collaboration that resulted in being able to deliver this feature and I’ll end with encouraging pointers on how to get more involved in Spark’s development.
The document provides an introduction to shell scripting basics in UNIX/Linux systems. It discusses what a shell and shell script are, introduces common shells like bash, and covers basic shell scripting concepts like running commands, variables, conditionals, loops, and calling external programs. Examples are provided for many common shell scripting tasks like file manipulation, text processing, scheduling jobs, and more.
Watch this talk here: https://www.confluent.io/online-talks/from-zero-to-hero-with-kafka-connect-on-demand
Integrating Apache Kafka? with other systems in a reliable and scalable way is often a key part of a streaming platform. Fortunately, Apache Kafka includes the Connect API that enables streaming integration both in and out of Kafka. Like any technology, understanding its architecture and deployment patterns is key to successful use, as is knowing where to go looking when things aren't working.
This talk will discuss the key design concepts within Apache Kafka Connect and the pros and cons of standalone vs distributed deployment modes. We'll do a live demo of building pipelines with Apache Kafka Connect for streaming data in from databases, and out to targets including Elasticsearch. With some gremlins along the way, we'll go hands-on in methodically diagnosing and resolving common issues encountered with Apache Kafka Connect. The talk will finish off by discussing more advanced topics including Single Message Transforms, and deployment of Apache Kafka Connect in containers.
Ever tried to get get clarity on what kinds of memory there are and how to tune each of them ? If not, very likely your jobs are configured incorrectly. As we found out, its is not straightforward and it is not well documented either. This session will provide information on the types of memory to be aware of, the calculations involved in determining how much is allocated to each type of memory and how to tune it depending on the use case.
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark Summit
?
Apache Spark 2.1.0 boosted the performance of Apache Spark SQL due to Project Tungsten software improvements. Another 16x times faster has been achieved by using Oracle’s innovations for Apache Spark SQL. This 16x improvement is made possible by using Oracle’s Software in Silicon accelerator offload technologies.
Apache Spark SQL In-memory performance is becoming more important due to many factors. Users are now performing more advanced SQL processing on multi-terabyte workloads. In addition on-prem and cloud servers are getting larger physical memory to enable storing these huge workloads be stored in memory. In this talk we will look at using Spark SQL in feature creation, feature generation within pipelines for Spark ML.
This presentation will explore workloads at scale and with complex interactions. We also provide best practices and tuning suggestion to support these kinds of workloads on real applications in cloud deployments. In addition ideas for next generation Tungsten project will also be discussed.
Linux uses /proc/iomem as a "Rosetta Stone" to establish relationships between software and hardware. /proc/iomem maps physical memory addresses to devices, similar to how the Rosetta Stone helped map Egyptian hieroglyphs to Greek and decode ancient Egyptian texts. This virtual file allows the kernel to interface with devices by providing address translations between physical and virtual memory spaces.
introduction to linux kernel tcp/ip ptocotol stack monad bobo
?
This document provides an introduction and overview of the networking code in the Linux kernel source tree. It discusses the different layers including link (L2), network (L3), and transport (L4) layers. It describes the input and output processing, device interfaces, traffic directions, and major developers for each layer. Config and benchmark tools are also mentioned. Resources for further learning about the Linux kernel networking code are provided at the end.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
Paolo Castagna is a Senior Sales Engineer at Confluent. His background is on 'big data' and he has, first hand, saw the shift happening in the industry from batch to stream processing and from big data to fast data. His talk will introduce Kafka Streams and explain why Apache Kafka is a great option and simplification for stream processing.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
In this presentation we describe the design and implementation of Kafka Connect, Kafka’s new tool for scalable, fault-tolerant data import and export. First we’ll discuss some existing tools in the space and why they fall short when applied to data integration at large scale. Next, we will explore Kafka Connect’s design and how it compares to systems with similar goals, discussing key design decisions that trade off between ease of use for connector developers, operational complexity, and reuse of existing connectors. Finally, we’ll discuss how standardizing on Kafka Connect can ultimately lead to simplifying your entire data pipeline, making ETL into your data warehouse and enabling stream processing applications as simple as adding another Kafka connector.
Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks
?
Apache Spark has the ‘speculative execution’ feature to handle the slow tasks in a stage due to environment issues like slow network, disk etc. If one task is running slowly in a stage, Spark driver can launch a speculation task for it on a different host. Between the regular task and its speculation task, Spark system will later take the result from the first successfully completed task and kill the slower one.
When we first enabled the speculation feature for all Spark applications by default on a large cluster of 10K+ nodes at LinkedIn, we observed that the default values set for Spark’s speculation configuration parameters did not work well for LinkedIn’s batch jobs. For example, the system launched too many fruitless speculation tasks (i.e. tasks that were killed later). Besides, the speculation tasks did not help shorten the shuffle stages. In order to reduce the number of fruitless speculation tasks, we tried to find out the root cause, enhanced Spark engine, and tuned the speculation parameters carefully. We analyzed the number of speculation tasks launched, number of fruitful versus fruitless speculation tasks, and their corresponding cpu-memory resource consumption in terms of gigabytes-hours. We were able to reduce the average job response times by 13%, decrease the standard deviation of job elapsed times by 40%, and lower total resource consumption by 24% in a heavily utilized multi-tenant environment on a large cluster. In this talk, we will share our experience on enabling the speculative execution to achieve good job elapsed time reduction at the same time keeping a minimal overhead.
The document discusses Scala, a programming language designed to be scalable. It can be used for both small and large programs. Scala combines object-oriented and functional programming. It interoperates seamlessly with Java but allows developers to add new constructs like actors through libraries. The Scala community is growing, with industrial adoption starting at companies like Twitter.
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
?
Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.
Improving Apache Spark for Dynamic Allocation and Spot InstancesDatabricks
?
This presentation will explore the new work in Spark 3.1 adding the concept of graceful decommissioning and how we can use this to improve Spark’s performance in both dynamic allocation and spot/preemptable instances. Together we’ll explore how Spark’s dynamic allocation has evolved over time, and why the different changes have been needed. We’ll also look at the multi-company collaboration that resulted in being able to deliver this feature and I’ll end with encouraging pointers on how to get more involved in Spark’s development.
The document provides an introduction to shell scripting basics in UNIX/Linux systems. It discusses what a shell and shell script are, introduces common shells like bash, and covers basic shell scripting concepts like running commands, variables, conditionals, loops, and calling external programs. Examples are provided for many common shell scripting tasks like file manipulation, text processing, scheduling jobs, and more.
Watch this talk here: https://www.confluent.io/online-talks/from-zero-to-hero-with-kafka-connect-on-demand
Integrating Apache Kafka? with other systems in a reliable and scalable way is often a key part of a streaming platform. Fortunately, Apache Kafka includes the Connect API that enables streaming integration both in and out of Kafka. Like any technology, understanding its architecture and deployment patterns is key to successful use, as is knowing where to go looking when things aren't working.
This talk will discuss the key design concepts within Apache Kafka Connect and the pros and cons of standalone vs distributed deployment modes. We'll do a live demo of building pipelines with Apache Kafka Connect for streaming data in from databases, and out to targets including Elasticsearch. With some gremlins along the way, we'll go hands-on in methodically diagnosing and resolving common issues encountered with Apache Kafka Connect. The talk will finish off by discussing more advanced topics including Single Message Transforms, and deployment of Apache Kafka Connect in containers.
Ever tried to get get clarity on what kinds of memory there are and how to tune each of them ? If not, very likely your jobs are configured incorrectly. As we found out, its is not straightforward and it is not well documented either. This session will provide information on the types of memory to be aware of, the calculations involved in determining how much is allocated to each type of memory and how to tune it depending on the use case.
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark Summit
?
Apache Spark 2.1.0 boosted the performance of Apache Spark SQL due to Project Tungsten software improvements. Another 16x times faster has been achieved by using Oracle’s innovations for Apache Spark SQL. This 16x improvement is made possible by using Oracle’s Software in Silicon accelerator offload technologies.
Apache Spark SQL In-memory performance is becoming more important due to many factors. Users are now performing more advanced SQL processing on multi-terabyte workloads. In addition on-prem and cloud servers are getting larger physical memory to enable storing these huge workloads be stored in memory. In this talk we will look at using Spark SQL in feature creation, feature generation within pipelines for Spark ML.
This presentation will explore workloads at scale and with complex interactions. We also provide best practices and tuning suggestion to support these kinds of workloads on real applications in cloud deployments. In addition ideas for next generation Tungsten project will also be discussed.
Linux uses /proc/iomem as a "Rosetta Stone" to establish relationships between software and hardware. /proc/iomem maps physical memory addresses to devices, similar to how the Rosetta Stone helped map Egyptian hieroglyphs to Greek and decode ancient Egyptian texts. This virtual file allows the kernel to interface with devices by providing address translations between physical and virtual memory spaces.
introduction to linux kernel tcp/ip ptocotol stack monad bobo
?
This document provides an introduction and overview of the networking code in the Linux kernel source tree. It discusses the different layers including link (L2), network (L3), and transport (L4) layers. It describes the input and output processing, device interfaces, traffic directions, and major developers for each layer. Config and benchmark tools are also mentioned. Resources for further learning about the Linux kernel networking code are provided at the end.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
Paolo Castagna is a Senior Sales Engineer at Confluent. His background is on 'big data' and he has, first hand, saw the shift happening in the industry from batch to stream processing and from big data to fast data. His talk will introduce Kafka Streams and explain why Apache Kafka is a great option and simplification for stream processing.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.
This document summarizes TenMax's data pipeline experience over the years from 2015 to 2017. It describes three versions of the data pipeline used to generate reports from raw event data. Version 1 used MongoDB but had poor write performance. Version 2 used Cassandra and had better write performance using LSM trees, but was costly to operate. Version 3 uses Kafka, Fluentd, Azure Blob storage and Spark to provide a scalable, cost-effective solution that can handle high throughput and complex aggregations. The document also discusses lessons learned around balancing features, costs and technologies like Spark, streaming and serverless models.
Gradle起步走: 以CLI Application為例 @ JCConf 2014Chen-en Lu
?
在 Java 的 Build System 當中,早期是 Ant 的天下,接下來漸漸被 Maven 取代,再到最近越來越多的人選擇使用 Gradle。由於 Gradle 的簡單強大,而且沿用了 Maven Central Repository,讓我們的專案可以更輕鬆的管理。此次分享以 CLI(Command Line Interface) 應用程式為例,從產生一個Gradle專案,整合 third party library,整合主流的 IDE,最後產生一個方便執行的 shell script,全部搞定。在短短的15分鐘,希望給還沒用過 Gradle 的朋友感受 Gradle 的魅力。
This document provides an introduction to real-time bidding (RTB) and retargeting for online advertising. It discusses the roles of advertisers, agencies, publishers and ad networks. It explains how RTB works through ad exchanges and real-time auctions. It also covers how targeting and retargeting can be used to show ads to the most relevant audiences, such as through contextual, behavioral and retargeting techniques. The document outlines how cookie matching across different domains allows retargeting users across multiple sites through real-time bidding on ad exchanges.
33. DataFrame Operations
33
? SQL
Select year, region, sum(people_total) as people_total
from population group by year, region order by people_total desc
? Spark Dataframe