狠狠撸

狠狠撸Share a Scribd company logo
Coming up to speed on Spark
Please send any comments to:
Adarsh Pannu
adarshrp@us.ibm.com
Intro
What is Spark? How does it
relate to Hadoop? When would
you use it?
1-2 hours
Basic Understand basic technology
and write simple programs
1-2 days
Intermediate
Start enabling customers in the
field, hand-holding them
through problems and issues.
5-15 days and
more
Expert
Know Spark inside out even if
you don’t intend to contribute to
the project itself.
Weeks to months
Intro Spark
Go through these presentations to understand the value of Spark. These speakers also
attempt to differentiate Spark from Hadoop, and enumerate its comparative strengths. (Not
much code here)
!? Turning Data into Value, Ion Stoica, Spark Summit 2013 Video & 狠狠撸s 25 mins
!? An Overview of Apache Spark, Jim Scott, Video 狠狠撸s 1 hr 06 mins
!? How Companies are Using Spark, and Where the Edge in Big Data Will Be, Matei
Zaharia, Video & 狠狠撸s 12 mins
!? Spark Fundamentals I (Lesson 1 only), Big Data University <20 mins
Basic Spark
!? Pick up some Scala through this article co-
authored by Scala’s creator, Martin Odersky.
Link
Estimated time: 2 hours
Basic Spark (contd.)
!? Do these two courses. They cover Spark basics and include a
certification. You can use the supplied Docker images for all other
labs.
7 hours
Basic Spark (contd.)
!? Go to spark.apache.org and study the Overview and the
Spark Programming Guide. Many online courses borrow
liberally from this material. Information on this site is
updated with every new Spark release.
Estimated 7-8 hours.
Intermediate Spark
!? Stay at spark.apache.org. Go through the component specific Programming Guides as
well as the sections on Deploying and More. Browse the Spark API as needed.
Estimated time 3-5 days and more.
Intermediate Spark (contd.)
Learn about the operational aspects of Spark:
!? Advanced Apache Spark (DevOps) Video 狠狠撸s 6 hours " EXCELLENT!
!? Tuning and Debugging Spark Video 狠狠撸s 48 mins
!? How-to: Tune Your Apache Spark Jobs Link ~ 1 hour
!? (Tons of other presentations, to be listed later)
Gain a high-level understanding of Spark architecture:
!? Introduction to AmpLab Spark Internals, Matei Zaharia (Databricks), Video 1 hr 15 mins
!? A Deeper Understanding of Spark Internals, Aaron Davidson (Databricks) Video PDF
44 mins
Intermediate Spark (contd.)
Experiment, experiment, experiment ... “Play the role of the customer”
!? Setup your personal 3-4 node cluster
!? Download some “open” data. E.g. “airline” data on stat-computing.org/dataexpo/2009/
!? Write some code, make it run, see how it performs, tune it, trouble-shoot it
!? Experiment with different deployment modes (Standalone + YARN)
!? Play with different configuration knobs, check out dashboards, etc.
!? Explore all subcomponents (especially Core, SQL, MLLib)
Read the original academic papers
!? Resilient Distributed Datasets: A Fault-
Tolerant Abstraction for In-Memory Cluster
Computing, Matei Zaharia, et. al.
!? Discretized Streams: An Efficient and Fault-
Tolerant Model for Stream Processing on
Large Clusters, Matei Zaharia, et. al.
!? GraphX: A Resilient Distributed Graph
System on Spark, Reynold S. Xin, et. al.
!? Spark SQL: Relational Data Processing in
Spark, Michael Armbrust, et. al.
Advanced Spark: Original Papers
Advanced Spark: Enhance your Scala skills
This book by
Odersky is arduously
long and isn’t meant
to give you a quick
start.
!? Use this as your
primary Scala text
!? Excellent MooC by Odersky. Some of
the material is meant for CS majors.
Highly recommended for STC
developers.
35+ hours
Advanced Spark: Browse Conference Proceedings
Spark Summits cover technology and use cases. Technology is also covered in various other places so
you could consider skipping those tracks. Don’t forget to check out the customer stories. That is how we
learn about enablement opportunities and challenges, and in some cases, we can see through the
Spark hype ?
100+ hours of FREE videos and associated PDFs available on spark-summit.org. You don’t even have
to pay the conference fee! Go back in time and “attend” these conferences!
We can produce a smaller ”watch list” of important videos and publish that internally.
Advanced Spark: Browse YouTube Videos
YouTube is full of training videos, some good, other not so much. These are the
only channels you need to watch though. There is a lot of repetition in the
material, and some of the videos are from the conferences mentioned earlier.
Advanced Spark: Check out these books
Provides a good overview of Spark but
much of the material is also available
through other sources previously
mentioned. Could be skipped.
!? Covers concrete statistical analysis /
machine learning use cases. Covers
Spark APIs and MLLib. Highly
recommended for data scientists.
Advanced Spark: Yes ... read the code
Even if you don’t intend to contribute to Spark, there are a ton of valuable comments in the code that
provide insights into Spark’s design. Don’t be shy! Go to github.com/apache/spark and check it to out.

More Related Content

Apache Spark: Coming up to speed

  • 1. Coming up to speed on Spark Please send any comments to: Adarsh Pannu adarshrp@us.ibm.com
  • 2. Intro What is Spark? How does it relate to Hadoop? When would you use it? 1-2 hours Basic Understand basic technology and write simple programs 1-2 days Intermediate Start enabling customers in the field, hand-holding them through problems and issues. 5-15 days and more Expert Know Spark inside out even if you don’t intend to contribute to the project itself. Weeks to months
  • 3. Intro Spark Go through these presentations to understand the value of Spark. These speakers also attempt to differentiate Spark from Hadoop, and enumerate its comparative strengths. (Not much code here) !? Turning Data into Value, Ion Stoica, Spark Summit 2013 Video & 狠狠撸s 25 mins !? An Overview of Apache Spark, Jim Scott, Video 狠狠撸s 1 hr 06 mins !? How Companies are Using Spark, and Where the Edge in Big Data Will Be, Matei Zaharia, Video & 狠狠撸s 12 mins !? Spark Fundamentals I (Lesson 1 only), Big Data University <20 mins
  • 4. Basic Spark !? Pick up some Scala through this article co- authored by Scala’s creator, Martin Odersky. Link Estimated time: 2 hours
  • 5. Basic Spark (contd.) !? Do these two courses. They cover Spark basics and include a certification. You can use the supplied Docker images for all other labs. 7 hours
  • 6. Basic Spark (contd.) !? Go to spark.apache.org and study the Overview and the Spark Programming Guide. Many online courses borrow liberally from this material. Information on this site is updated with every new Spark release. Estimated 7-8 hours.
  • 7. Intermediate Spark !? Stay at spark.apache.org. Go through the component specific Programming Guides as well as the sections on Deploying and More. Browse the Spark API as needed. Estimated time 3-5 days and more.
  • 8. Intermediate Spark (contd.) Learn about the operational aspects of Spark: !? Advanced Apache Spark (DevOps) Video 狠狠撸s 6 hours " EXCELLENT! !? Tuning and Debugging Spark Video 狠狠撸s 48 mins !? How-to: Tune Your Apache Spark Jobs Link ~ 1 hour !? (Tons of other presentations, to be listed later) Gain a high-level understanding of Spark architecture: !? Introduction to AmpLab Spark Internals, Matei Zaharia (Databricks), Video 1 hr 15 mins !? A Deeper Understanding of Spark Internals, Aaron Davidson (Databricks) Video PDF 44 mins
  • 9. Intermediate Spark (contd.) Experiment, experiment, experiment ... “Play the role of the customer” !? Setup your personal 3-4 node cluster !? Download some “open” data. E.g. “airline” data on stat-computing.org/dataexpo/2009/ !? Write some code, make it run, see how it performs, tune it, trouble-shoot it !? Experiment with different deployment modes (Standalone + YARN) !? Play with different configuration knobs, check out dashboards, etc. !? Explore all subcomponents (especially Core, SQL, MLLib)
  • 10. Read the original academic papers !? Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing, Matei Zaharia, et. al. !? Discretized Streams: An Efficient and Fault- Tolerant Model for Stream Processing on Large Clusters, Matei Zaharia, et. al. !? GraphX: A Resilient Distributed Graph System on Spark, Reynold S. Xin, et. al. !? Spark SQL: Relational Data Processing in Spark, Michael Armbrust, et. al. Advanced Spark: Original Papers
  • 11. Advanced Spark: Enhance your Scala skills This book by Odersky is arduously long and isn’t meant to give you a quick start. !? Use this as your primary Scala text !? Excellent MooC by Odersky. Some of the material is meant for CS majors. Highly recommended for STC developers. 35+ hours
  • 12. Advanced Spark: Browse Conference Proceedings Spark Summits cover technology and use cases. Technology is also covered in various other places so you could consider skipping those tracks. Don’t forget to check out the customer stories. That is how we learn about enablement opportunities and challenges, and in some cases, we can see through the Spark hype ? 100+ hours of FREE videos and associated PDFs available on spark-summit.org. You don’t even have to pay the conference fee! Go back in time and “attend” these conferences! We can produce a smaller ”watch list” of important videos and publish that internally.
  • 13. Advanced Spark: Browse YouTube Videos YouTube is full of training videos, some good, other not so much. These are the only channels you need to watch though. There is a lot of repetition in the material, and some of the videos are from the conferences mentioned earlier.
  • 14. Advanced Spark: Check out these books Provides a good overview of Spark but much of the material is also available through other sources previously mentioned. Could be skipped. !? Covers concrete statistical analysis / machine learning use cases. Covers Spark APIs and MLLib. Highly recommended for data scientists.
  • 15. Advanced Spark: Yes ... read the code Even if you don’t intend to contribute to Spark, there are a ton of valuable comments in the code that provide insights into Spark’s design. Don’t be shy! Go to github.com/apache/spark and check it to out.