This document provides guidance on learning Apache Spark at basic, intermediate, and advanced levels. The basic level introduces Spark concepts and involves simple programming exercises that take 1-2 days. The intermediate level explores operational aspects over 3-5 days and recommends experimenting with Spark. The advanced level involves reading original academic papers, enhancing Scala skills, attending Spark conferences, reviewing recommended books, and examining Spark's codebase. The overall goal is to know Spark in-depth even without contributing to the project itself, which can take weeks to months.
1 of 15
Download to read offline
More Related Content
Apache Spark: Coming up to speed
1. Coming up to speed on Spark
Please send any comments to:
Adarsh Pannu
adarshrp@us.ibm.com
2. Intro
What is Spark? How does it
relate to Hadoop? When would
you use it?
1-2 hours
Basic Understand basic technology
and write simple programs
1-2 days
Intermediate
Start enabling customers in the
field, hand-holding them
through problems and issues.
5-15 days and
more
Expert
Know Spark inside out even if
you don’t intend to contribute to
the project itself.
Weeks to months
3. Intro Spark
Go through these presentations to understand the value of Spark. These speakers also
attempt to differentiate Spark from Hadoop, and enumerate its comparative strengths. (Not
much code here)
!? Turning Data into Value, Ion Stoica, Spark Summit 2013 Video & 狠狠撸s 25 mins
!? An Overview of Apache Spark, Jim Scott, Video 狠狠撸s 1 hr 06 mins
!? How Companies are Using Spark, and Where the Edge in Big Data Will Be, Matei
Zaharia, Video & 狠狠撸s 12 mins
!? Spark Fundamentals I (Lesson 1 only), Big Data University <20 mins
4. Basic Spark
!? Pick up some Scala through this article co-
authored by Scala’s creator, Martin Odersky.
Link
Estimated time: 2 hours
5. Basic Spark (contd.)
!? Do these two courses. They cover Spark basics and include a
certification. You can use the supplied Docker images for all other
labs.
7 hours
6. Basic Spark (contd.)
!? Go to spark.apache.org and study the Overview and the
Spark Programming Guide. Many online courses borrow
liberally from this material. Information on this site is
updated with every new Spark release.
Estimated 7-8 hours.
7. Intermediate Spark
!? Stay at spark.apache.org. Go through the component specific Programming Guides as
well as the sections on Deploying and More. Browse the Spark API as needed.
Estimated time 3-5 days and more.
8. Intermediate Spark (contd.)
Learn about the operational aspects of Spark:
!? Advanced Apache Spark (DevOps) Video 狠狠撸s 6 hours " EXCELLENT!
!? Tuning and Debugging Spark Video 狠狠撸s 48 mins
!? How-to: Tune Your Apache Spark Jobs Link ~ 1 hour
!? (Tons of other presentations, to be listed later)
Gain a high-level understanding of Spark architecture:
!? Introduction to AmpLab Spark Internals, Matei Zaharia (Databricks), Video 1 hr 15 mins
!? A Deeper Understanding of Spark Internals, Aaron Davidson (Databricks) Video PDF
44 mins
9. Intermediate Spark (contd.)
Experiment, experiment, experiment ... “Play the role of the customer”
!? Setup your personal 3-4 node cluster
!? Download some “open” data. E.g. “airline” data on stat-computing.org/dataexpo/2009/
!? Write some code, make it run, see how it performs, tune it, trouble-shoot it
!? Experiment with different deployment modes (Standalone + YARN)
!? Play with different configuration knobs, check out dashboards, etc.
!? Explore all subcomponents (especially Core, SQL, MLLib)
10. Read the original academic papers
!? Resilient Distributed Datasets: A Fault-
Tolerant Abstraction for In-Memory Cluster
Computing, Matei Zaharia, et. al.
!? Discretized Streams: An Efficient and Fault-
Tolerant Model for Stream Processing on
Large Clusters, Matei Zaharia, et. al.
!? GraphX: A Resilient Distributed Graph
System on Spark, Reynold S. Xin, et. al.
!? Spark SQL: Relational Data Processing in
Spark, Michael Armbrust, et. al.
Advanced Spark: Original Papers
11. Advanced Spark: Enhance your Scala skills
This book by
Odersky is arduously
long and isn’t meant
to give you a quick
start.
!? Use this as your
primary Scala text
!? Excellent MooC by Odersky. Some of
the material is meant for CS majors.
Highly recommended for STC
developers.
35+ hours
12. Advanced Spark: Browse Conference Proceedings
Spark Summits cover technology and use cases. Technology is also covered in various other places so
you could consider skipping those tracks. Don’t forget to check out the customer stories. That is how we
learn about enablement opportunities and challenges, and in some cases, we can see through the
Spark hype ?
100+ hours of FREE videos and associated PDFs available on spark-summit.org. You don’t even have
to pay the conference fee! Go back in time and “attend” these conferences!
We can produce a smaller ”watch list” of important videos and publish that internally.
13. Advanced Spark: Browse YouTube Videos
YouTube is full of training videos, some good, other not so much. These are the
only channels you need to watch though. There is a lot of repetition in the
material, and some of the videos are from the conferences mentioned earlier.
14. Advanced Spark: Check out these books
Provides a good overview of Spark but
much of the material is also available
through other sources previously
mentioned. Could be skipped.
!? Covers concrete statistical analysis /
machine learning use cases. Covers
Spark APIs and MLLib. Highly
recommended for data scientists.
15. Advanced Spark: Yes ... read the code
Even if you don’t intend to contribute to Spark, there are a ton of valuable comments in the code that
provide insights into Spark’s design. Don’t be shy! Go to github.com/apache/spark and check it to out.