This document provides an overview of Apache Spark, a fast and general engine for large-scale data processing. It discusses how Spark can be used to query and summarize data stored in different data sources like MongoDB, MySQL, and Redis in a single Spark job. The document then demonstrates a Spark job that retrieves weather station data from MongoDB and MySQL, aggregates it, stores the results in Redis, and retrieves the top 10 results.
1 of 25
Download to read offline
More Related Content
One Tool to Rule Them All- Seamless SQL on MongoDB, MySQL and Redis with Apache Spark
1. Tim Vaillancourt
Sr. Technical Operations Architect
One Tool to Rule Them All: Seamless SQL on
MongoDB, MySQL and Redis with Apache Spark
3. About Me
? Joined Percona in January 2016
? Sr Technical Operations Architect for MongoDB
? Previous:
? EA DICE (MySQL DBA)
? EA SPORTS (Sys/NoSQL DBA Ops)
? Amazon/AbeBooks Inc (Sys/MySQL+NoSQL DBA Ops)
? Main techs: MySQL, MongoDB, Cassandra, Solr, Redis, queues, etc
? 10+ years tuning Linux for database workloads (off and on)
? NOT an Apache Spark expert
4. Apache Spark
? ¡°¡is a fast and general engine for large-scale data processing¡±
? Written in Scala, utilises Akka streaming framework and runs under Java JVM
? Supports jobs written in Java, Scala, Python and R
? Pluggable datasources for various file types, databases, SaaSs, etc
? Fast and efficient: jobs work on datasources quickly in parallel
? Optional clustering
? Master/Slave Spark cluster, Zookeeper for Master elections
? Slave workers connect to master
? Master distributes tasks evenly to available workers
? Streaming and Machine Learning (MLib) capabilities
? Programatic and SQL(!) querying capabilities
5. Apache Spark: Software Architecture
? Jobs go to Cluster Master or runs in the
client JVM directly
? Cluster Master directs jobs to nodes
with available resources with messages
? Cluster Master HA
? Slaves reconnect to Master
? Apache Zookeeper for true HA
6. Apache Spark: Hadoop Comparison
? Hadoop MapReduce
? Batch-based
? Uses data less efficiently
? Relatively hard to develop/maintain
? Spark
? Stream Processing
? Fast/Parallelism
? Prefers memory as much as possible in jobs
? Divides work into many lightweight sub-tasks in threads
? Datasources
? Uses datasource-awareness to scale (eg: indices, shard-awareness, etc)
? Spark allows processing and storage to scale separately
7. Apache Spark: RDDs and DataFrames
? RDD: Resilient Distributed Dataset
? Original API to access data in Spark
? Lazy: does not access data until a real action is performed
? Spark¡¯s optimiser cannot see inside
? RDDs are slow on Python
? DataFrames API
? Higher level API, focused on the ¡°what¡± is being done
? Has schemas / table-like
? Interchangeable Programming and SQL APIs
? Much easier to read and comprehend
? Optimises execution plan
8. Apache Spark: Datasources
? Provides a pluggable mechanism for accessing structured data though Spark SQL
? At least these databases are supported in some way
? MySQL
? MongoDB
? Redis
? Cassandra
? Postgres
? HBase
? HDFS
? File
? S3
? In practice: search GitHub, find .jar file, deploy it!
9. Apache Spark: SQLContext
? ANSI SQL
? 30+ year old language..
? Easy to understand
? Everyone usually knows it
? Spark SQLContext
? A Spark module for structured data processing, wrapping RDD API
? Uses the same execution engine as the programatic APIs
? Supports:
? JOINs/unions
? EXPLAINs,
? Subqueries,
? ORDER/GROUP/SORT BYs
? Most datatypes you¡¯d expect
10. Apache Spark: Use Cases
? Business Intelligence/Analytics
? Understand
? Tip: use dedicated replicas for expensive queries!
? Data Summaries and Batch Jobs
? Perform expensive summaries in the background,
save result
? Tip: use burstable/cloud hardware for infrequent
batch jobs
? Real-time Stream Processing
? Process data as it enters your system
11. So why not Apache Drill?
? A schema-free SQL engine for Hadoop, NoSQL and Cloud Storage
? Drill does not support / work with
? Relational databases (MySQL) or Redis
? No programatic-level querying
? No streaming/continuous query functionality
? I don¡¯t know much about it
12. The Demo
? Scenario: You run a Weather Station data app that stores data in both an
RDBMs and a document store
? Goal: summarise weather station data stored in an RDBMs and a
Document store
? Min Water Temperature
? Avg Water Temperature
? Max Water Temperature
? Total Sample Count
? Get Top-10 (based on avg water temp)
13. The Demo
? RDBMs: Percona Server for MySQL 5.7
? Stores the Weather station metadata data (roughly 350 stations: ID,
name, location, etc)
? Document-Store: Percona Server for MongoDB 3.2
? Stores the Weather time-series sample data (roughly 80,000 samples:
various weather readings from stations)
? In-Memory K/V Store: Redis 2.8
? Store summarised Top-10 data for fast querying of min, avg, max
temperature and total sample counts
14. The Demo
? Apache Spark 1.6.2
Cluster
? 1 x Master
? 2 x Worker/Slaves
? 1 x Pyspark Job
? 1 x Macbook Pro
? 3 x Virtualbox VMs
? Job submitted on Master