�ݺ�ߣ

A big data processing tool built with Scala and runs on JVM
Introduce to
Spark
ADB 2017
Yen Hao Huang
1

Big Data
● 4Vs
○ Volume/Variety/Velocity/Veracity
Due to the rise of Big Data, faster tools are required for
processing data.
2

Hadoop
● A platform to store and process large scale data
● Features
○ Scalable
○ Economical : many cheap servers
○ Flexible : schema-less
○ Reliable : replicas
4

Hadoop MapReduce
● Map
- Divide job to multiple tiny tasks and distribute to
servers
● Reduce
- Summary the results from those servers
5

Hadoop MapReduce
6
Figure Refence

● File I/O - write the middle process data to disk
Hadoop - Bottleneck
Iteration Iteration
Read Read WriteWrite
7

RDD
In-memory computation framework
9

RDD (Resilient Distributed Dataset)
● Write the middle process data to memory
● 10 - 100 times faster than hadoop
Iteration Iteration
Read Memory Read WriteMemory Write
RDD
10

Spark
● Features
○ Speed
○ Ease of use : Scala、Python、Java、R
○ Supports hadoop : HDFS、MapReduce
○ Accessibility : runs on many platforms
11

RDD Features
● Computations
○ Transformation - Lazy compute
○ Action - Execute the computations
○ Persistence - Keep RDD in ram/ disk
Transformation
RDD OutputAction
12

● Error Fixing
RDD Lineage
RDD1 RDD2
Transformation Action
[ 7, 10 ]
[ 2, 3 ] [ ?, ? ]f(x) = x2
+1
13
RDD2RDD1
2
[ 7, 10 ]
Fix !
1

�ݺ�ߣ

Introduce to spark

More Related Content

Introduce to spark