Spark is a big data processing tool built with Scala that runs on the Java Virtual Machine (JVM). It is up to 100 times faster than Hadoop for iterative jobs because it keeps intermediate data in memory rather than writing to disk. Spark uses Resilient Distributed Datasets (RDDs) that allow data to be kept in memory across transformations and actions. RDDs also maintain lineage graphs to allow recovery from failures.
1 of 14
Download to read offline
More Related Content
Introduce to spark
1. A big data processing tool built with Scala and runs on JVM
Introduce to
Spark
ADB 2017
Yen Hao Huang
1
2. Big Data
4Vs
Volume/Variety/Velocity/Veracity
Due to the rise of Big Data, faster tools are required for
processing data.
2
4. Hadoop
A platform to store and process large scale data
Features
Scalable
Economical : many cheap servers
Flexible : schema-less
Reliable : replicas
4
5. Hadoop MapReduce
Map
- Divide job to multiple tiny tasks and distribute to
servers
Reduce
- Summary the results from those servers
5
10. RDD (Resilient Distributed Dataset)
Write the middle process data to memory
10 - 100 times faster than hadoop
Iteration Iteration
Read Memory Read WriteMemory Write
RDD
10
11. Spark
Features
Speed
Ease of use : ScalaPythonJavaR
Supports hadoop : HDFSMapReduce
Accessibility : runs on many platforms
11
12. RDD Features
Computations
Transformation - Lazy compute
Action - Execute the computations
Persistence - Keep RDD in ram/ disk
Transformation
RDD OutputAction
12