狠狠撸

2 0 1 7 . 0 6
給初學者的 S p a r k 教學
P o p c o r n y ( 陸振恩 )

Who am I
? 陸振恩 (popcorny)
? Director of Engineering @TenMax
? 之前經歷
– 交大資科所
– 第四屆趨勢百萬程式競賽冠軍
– 聯發科技 (2005- 2010)
– SmartQ (2011 – 2014)
– cacaFly/TenMax (2014-present)
? FB: https://fb.me/popcornylu
2

Target Audience
? 有基本的Java寫作能力
? 最好有Java8 Stream或是其他語言function language相關的基本概念 (map,
flatMap, filter, reduce, …)
? 還不會寫Spark，或是看過Spark的書還沒動手做的
3

Outline
? 了解Spark的基本常識
? 介紹Spark DataFrame/SQL
? 寫一個Spark Application
4

? 介紹Spark DataFrame/SQL
5

Introduction to Spark
? Spark是一個分散式運算引擎
? MapReduce框架
? 以RDD為基礎 (Resilient Distributed Datasets)
6

Spark適合什麼
? 適合
– 大資料量的批次資料處理
– 流運算
– 各種資料量的ETL及資料分析
? 不適合
– RDBMS就可以解決你的需求的時候
7

Big Data ArchitectureDistributedFileSystem
DistributedFileSystem
Resource Manager
Computation Framework
Application Framework
Application
8

Hadoop Application
Pig / Hive
Hadoop ArchitectureHDFS
HDFS
YARN
Hadoop MapReduce V2
9

Spark ArchitectureDFS
DFS
YARN
Spark Application
Spark DataFrame,SQL / Stream / Mllib / GraphX
10
Spark

Spark Application
Driver
Executor Executor
Executor Executor
Spark
context
Executor
Application
spark-submit
application.jar
Cluster 11
Node
JVM Process

Spark RDD
? Resilient Distributed Dataset
? 可以把它想成是Java的Stream，只是分散式的版本
? 特色
– Lazy Evaluation: 只有在action被觸發時，才會真正運算，否則只是
建立關聯而已。
– Partitioned: 資料可以分成很多可以平行處理的partition
– Cachable: 運算過的資料可以暫存在executor中。
– Reusable: RDD可以被重複使用，相較的Java Stream就只能用一
次。
12

Spark RDD
? 處理資料不外乎Input, Transformation, Output
? 或是稱為ETL (Extract, Transformation, Load)
? 而Spark中
– Input是由spark context產生的RDD
– 由RDD可以產生一系列的transformation
– 最後執行一個action，會啟動整個pipeline，並且產生output到
action所對應的地方
13

Input
? 都是從spark context取得input的RDD
? sc.parallelize(list): 把一個list送到spark cluster
? sc.textFile(path): 從path取得一個文字檔
14

Simple Operations
? map(func): 一對一的轉換
T ? U
? flatMap(func): 一對多的轉換
T ? 0..* U
? mapPartitions(func) : 多對多的轉換
T0..* ? 0..* U
? filter(func) : 過濾器
T ? 0..1T
15

Shuffle Operations (Single Source)
? groupByKey([numTasks]): 把同樣的key的資料串成一個list
(K, V) ? (K, Iterable<V>)
? reduceByKey(func, [numTasks]): 把同樣的資料reduce起來
(K, V) ? (K, V),
reducer (V,V) ? V
? aggregateByKey(zeroValue, seqOp, combOp, [numTasks]): 把同樣的資料
reduce起來，但是透過accumulator
(K, V) ? (K, U),
seqOp (U,V) -> U,
combOp (U,U) ? U
? sortByKey([ancending], [numTasks]): 根據key排序
(K, V) ? (K, V)
16

Shuffle Operations (Two Sources)
? cartesian(otherDataset, [numTasks]): 把兩邊的資料n x m種的完全配對。例
如撲克牌的4個花色 x 13個數字可以配對成整副牌。
T, U ? (T, U)
? join(otherDataset, [numTasks]): 把同key的資料join起來，支援inner join,
left/right/full outer join
(K, V), (K, W) ? (K, (V, W))
? cogroup(otherDataset, [numTasks]): 類似gropuByKey，只是是兩個
sources的版本
(K, V), (K, W) ? (K, (Iterable<V>, Iterable<W>))
17

Repartition Operations
? repartition(numParitions): 單純shuffle
? coalesce(numParitions): 不會shuffle，只是減少partition數量
18

Actions
? 寫檔案
– saveAsTextFile(path)
? 傳回driver
– first(): 取得第一筆
– take(n): 取得前n筆
– collect(): 取得所有的結果
– count(): 算結果有幾筆
– reduce(func): 用一個reducer去收資料
? 直接在exectuor內部執行
– foreach(func): 直接在executor中一個一個item callback
19

Shuffle
? 資料交換的動作
? 資料必須要先有key, value
? 用key來分群
? 同一個key的一定被分到同一個partition
? 這東西其實就是MapReduce在做的事情
23

Shuffle
24Source: MapReduce Shuffle原理与 Spark Shuffle原理

Job, Stage, Task
? Application由spark-submit產生
? Job由action operation產生
? Stage由shuffle operation產生，不同stage可以有不同的task數量。
? Task由shuffle operation的tasks或由input partition來決定數量，為平行
處理中最小不可切割的任務。
Cluster Application Job Stage Task
1 * 1 * 1 * 1 *
25

Operations
groupByKey
reduceByKey
aggregateByKey
repartition
map
flatMap
mapPartitions
filter
cartesian
join
cogroup
foreach
foreachPartitions
sc.textFile
sc.xxxFile
Driver Program
saveAsTextFile
saveAsXxxFile
sc.parallelized collect first
take count
reduce
26

? 了解Spark DataFrame/SQL常用操作
27

Spark DataFrame & Dataset
? DataFrame
– 就像是RDBMS的table
– 有Schema，並且可以是巢狀的
– Dataset<Row>
– 一筆資料由很多columns所組成
? Dataset
– Dataset<T>
– Typed dataset
28

Reader and Writer
? Input/output的來源
– RDD
– File
? 支援的格式
– CSV
– Json
– Parquet (推薦)
29

DataFrame Operations
? select(column…)
? distinct()
? join(right, column)
? where(column)
? groupBy(columns…)
? agg(column...)
? orderBy(column…)
30

DataFrame Functions
? Import org.apache.spark.sql.functions
? Normal Functions
– col(name)
? Aggregation Functions
– min(column)
– max(column)
– count(column)
– sum(column)
– avg(column)
31

32
? Data

33
? SQL
Select year, region, sum(people_total) as people_total
from population group by year, region order by people_total desc
? Spark Dataframe

DataFrame Schema
? 定義Schema
– JavaBean, Encoder
– 程式化指定
– Metastore (Hive)
– 從檔案內容去推導schema
? 檢查Schema
– df.printSchema()
34

Spark SQL
? 用SQL語法來query dataframe
? SQL本身是一個declarative語言，所以內建優化引擎，把它變成phisicial的
dataframe operations
? Output則是另外一個dataframe
35

1. 了解Spark的基本常識
2. 了解Spark DataFrame/SQL常用操作
3. 寫一個Spark Application
37

Spark Application
? 包裝在一個application jar
? 透過spark-submit來執行程式
? Submit需要指定master
? Master代表的是一個resource manager或說是cluster manager。
Submit之後會在整個resource manager取得所需要的資源
? Spark application透過spark context跟這些資源互動
38

Uber jar
? 因為spark application jar需要傳到各個executer執行，所以要怎麼把用到的
library也傳過去?
? 把所用到的jar檔解開，必且直接包在application jar，這種方法就叫做uber jar
? 或稱fat jar或shadow jar
39

Spark Template Project
? https://github.com/popcornylu/spark-wordcount
? Commands
– Application Jar:
./gradlew jar
spark-submit –master local[*] build/libs/spark-wordcount.jar
– Application uber jar
./gradlew shadowJar
spark-submit –master local[*] build/libs/spark-wordcount-
all.jar
40

Resource Manager
? Local
? Standalone cluster
? YARN cluster
? Mesos cluster
41

Spark Web UI
? 預設在跑spark application的時候可以啟動WebUI (port: 4040, 4041,….)
? 可以用來看Job, Stage, Task的進度
? Debug好工具
42

History Server
? WebUI只能看到正在執行的spark application
? 但是可以透過history server已經結束的application的紀錄
43

Configurations
? conf/log4.properties: Log Configuration。可以把預設log level從INFO改
成WARN
? conf/core-site.xml: File System Configuration。如果有用到DFS要在這
邊設定。
? conf/spark-default.xml: Default Application Configuration。例如預設的
master，或是預設要記錄history都要在這邊設定
? conf/spark-env.sh: Default Environment Variable。主要是各個daemon執
行的環境變數。
44

Recap
? Spark是一個分散式的運算引擎
? 由RDD所構成，有Input, Transformations, Action
? 執行一個Action換產生Job，一個Job可能有很多Stages，每個Stages有不一樣
的task數量
? Shuffle的原理
? Spark DataFrame跟Spark SQL
? 如何寫一個Spark Application
45

狠狠撸

给初学者的厂辫补谤办教学

Recommended

More Related Content

What's hot (20)

Similar to 给初学者的厂辫补谤办教学 (20)

More from Chen-en Lu (6)

给初学者的厂辫补谤办教学