際際滷

際際滷Share a Scribd company logo
A big data processing tool built with Scala and runs on JVM
Introduce to
Spark
ADB 2017
Yen Hao Huang
1
Big Data
 4Vs
 Volume/Variety/Velocity/Veracity
Due to the rise of Big Data, faster tools are required for
processing data.
2
Hadoop
3
Hadoop
 A platform to store and process large scale data
 Features
 Scalable
 Economical : many cheap servers
 Flexible : schema-less
 Reliable : replicas
4
Hadoop MapReduce
 Map
- Divide job to multiple tiny tasks and distribute to
servers
 Reduce
- Summary the results from those servers
5
Hadoop MapReduce
6
Figure Refence
 File I/O - write the middle process data to disk
Hadoop - Bottleneck
Iteration Iteration
Read Read WriteWrite
7
Spark
8
RDD
In-memory computation framework
9
RDD (Resilient Distributed Dataset)
 Write the middle process data to memory
 10 - 100 times faster than hadoop
Iteration Iteration
Read Memory Read WriteMemory Write
RDD
10
Spark
 Features
 Speed
 Ease of use : ScalaPythonJavaR
 Supports hadoop : HDFSMapReduce
 Accessibility : runs on many platforms
11
RDD Features
 Computations
 Transformation - Lazy compute
 Action - Execute the computations
 Persistence - Keep RDD in ram/ disk
Transformation
RDD OutputAction
12
 Error Fixing
RDD Lineage
RDD1 RDD2
Transformation Action
[ 7, 10 ]
[ 2, 3 ] [ ?, ? ]f(x) = x2
+1
13
RDD2RDD1
2
[ 7, 10 ]
Fix !
1
Spark Functionality
14

More Related Content

Introduce to spark