際際滷

Spark DataFrames Introduction
祇傲梳LT疾り及2指(�y�、�C亠僥�、デ�`タ渇竃)
Yu Ishikawa
1

�k燕のゴ�`ル
? Hadoop MapReduce を��里瓩鵑匹い鍠`
? Python ユ�`ザや R ユ�`ザが�ビッグデ�`タを
�毫Xに�Qうにはどうすりゃいいの�
2
それ Spark DataFrame でできるよ

Apache Spark 岑ってるよね�
3

Speed
4
Hadoop MapReduce に曳べて 10x ~ 100x 壼い

Ease of Use
5
PageRank も方佩で�g廾できる

Generality
6
�C亠僥�ライブラリなども��覆農繒辰任④�

Runs Everywhere
7
Hadoop, Mesos などと�N源な銭亊

2015 定の嶷泣�_�k圭�
? Data Science
�C 聞いやすい high-level APIs の戻工
? scikit-learn のような匯�來をもった APIs
? Platform Interfaces
�C ��鮀塒發離禰`タソ�`スへのアクセスやアルゴリズ
ムをより��gに旋喘できるインタ�`フェ�`スの戻工
? spark-packages というパッケ�`ジ砿尖ツ�`ルのようなもので 3rd
party ライブラリを旋喘できるようにする
8

DataFrame APIs の�B初
仝Spark なにそれ�々って繁も謹いと
房うので燕中議な�をします
9

Agenda
? ��
? DataFrame APIs とは�
? DataFrame APIs の�B初
? Demo
? まとめ
10

Agenda
? ��
? Demo
? まとめ
11

DataFrame APIs とは
? デ�`タサイエンスでよく聞われる�I尖を Domain-specific functions
にしたもの
�C Project
�C Filter
�C Aggregation
�C Join
�C UDFs
? Python, Java, Scala and R (via SparkR) から旋喘できる
? Spark 1.3 でリリ�`スされる�C嬬
12
ビッグデ�`タを Spark 貧で
より��でより互堀に�I尖できる�C嬬

Spark ならより��gに MR を�g廾できる
13

DataFrame APIs で Spark をより��に荷恬
14

DataFrame の恷�m晒�C��によってより互堀に�I尖ができる
15

Agenda
? ��
? Demo
? まとめ
16

麼な DataFrame APIs
? Creation
? Check Schema
? Project
? Filter
? Aggregation
? Join
? UDFs
17

Creation
? DataFrame API で�Qうデ�`タの�iみ�zみ
? JSON, Hive, Purque などが旋喘できる
18
// Create a SQLContext (sc is an existing SparkContext)
val context = new org.apache.spark.sql.SQLContext(sc)
// Create a DataFrame for Github events
var path = "file:///tmp/github-archive-data/*.json.gz"
val event = context.load(path, "json").as('event)
// Create a DataFrame for Github users
path = "file:///tmp/github-archive-data/github-users.json"
val user = context.load(path, "json").as('user)

Project
? select() で函り竃したいカラムを�x�k
�C $￣parent.child￣の�隈でネストされたカラムも�x�kできる
? select(｀key as ｀alias) のように as でエイリアスを恬れる
20
// Select a column
event("public￣)
event.select(｀public as ｀PUBLIC)
// Select multile columns with aliases
event.select('public as 'PUBLIC, 'id as 'ID)
// Select nested columns with aliases
event.select($"payload.size" as 'size, $"actor.id" as 'actor_id)

Filter
? filter() は SQL の WHERE のような叨護
? �}方の訳周を峺協する��呂錬�劼箸弔� filter() に秘れるこ
とも�ふたつの filter() のチェ�`ンに蛍けることもできる
21
// Filter by a condition
user.filter("name is not null￣)
// Filter by a comblination of two conditions
event.filter("public = true and type = 'ForkEvent'￣)
event.filter("public = true").filter("type = 'ForkEvent'￣)

Aggregation
? count() は�g�にレコ�`ド方を方える
? groupBy() は SQL の GROUP BY の�Pき
? agg() を�Mみ栽わせることで璃薦をさらに�k�]
22
// Count the number of records
user.count
// Group by ｀type column and then count
event.groupBy("type").count()
// Aggregate by ｀id column
user.groupBy('id).agg('id, count("*"), sum('id))

Join
? まず as() で光 DataFrame のエイリアスを鞠�h
? join() と where() で�Y栽と�Y栽訳周を峺協
? �Y栽した�Y惚のカラムを函り竃すには�鞠�hしたエ
イリアスを旋喘すれば措い
23
// Register their aliases
val user = user.as('user)
val pr = event.filter('type === "PullRequestEvent").as('pr)
// Join the two data sets
val join = pr.join(user).where($"pr.payload.pull_request.user.id" === $"user.id")
join.select($"pr.type", $"user.name", $"pr.created_at￣)

UDF: User Defined Function
? udf() で鏡徭�v方を協�xして�DataFrame の嶄で旋喘できる
? 箭�猟忖双＾2015-01-01T00:00:00Z￣から
�C ＾2015-01-01￣を渇竃する�v方を協�x
�C ＾00:00:00￣を渇竃する�v方を協�x
24
// Define User Defined Functions
val toDate = udf((createdAt: String) => createdAt.substring(0, 10))
val toTime = udf((createdAt: String) => createdAt.substring(11, 19))
// Use the UDFs in select()
event.select(toDate('created_at) as 'date, toTime('created_at) as 'time)

Agenda
? ��
? Demo
? まとめ
25

With Machine Learning (1)
? Github のコミットメッセ�`ジに��して�word2vec を�m�
26
val toIter = udf((x: ArrayBuffer[String]) => x.mkString(delimiter))
val messages = event.
select($"payload.commits.message" as 'messages).filter("messages is not null").
select(toIter($"messages")).
flatMap(row => row.toSeq.map(_.toString).apply(0).split(delimiter))
val message = messages.map(_.replaceAll("""(n|)""", "").
replaceAll("""s+""", " ").split(" ").map(_.replaceAll("""(,|.)$""", "")).toSeq).
filter(_.size > 0)
// create a model
val model = new Word2Vec().fit(message)

With Machine Learning (2)
? ��Bしたモデルで�貌�Zを竃薦
27
> model.findSynonyms("bug", 10).foreach(println)
(issue,0.6874246597290039)
(typo,0.663004457950592)
(bugs,0.599325954914093)
(errors,0.5887047052383423)
(problem,0.5665265321731567)
(fixes,0.5617778897285461)
(spelling,0.5353047847747803)
(crash,0.5330312848091125)
(Fixed,0.5128884315490723)
(small,0.5113803744316101)

Agenda
? ��
? Demo
? まとめ
28

まとめ
? DataFrame APIs は�ビッグデ�`タを Spark 貧
でより��より互堀に�I尖できる�C嬬
? groupBy, agg, count, join などのデ�`タ荷恬で
よく聞う�v方が��笋気譴討い�
�C Pandas を岑っている繁は�あんな湖じをイメ�`ジ
してくれるとよい
? UDF で鏡徭の�v方も協�xできる
? �C亠僥�ライブラリとの�Mみ栽わせられる
29

DataFrame の�n�}
? Apache Spark の�C亠僥�ライブラリ mllib に
つなぎこむためのデ�`タ��Qが中宜くさい
? よりシ�`ムレスに�B亊できるような碧�Mみが
駅勣だと房われる
30

DataFrame Introduction
? spark-dataframe-introduction
? http://goo.gl/Futoi0
31

際際滷

2015 03-12 祇傲梳LT疾り及2指 Spark DataFrame Introduction

More Related Content

2015 03-12 祇傲梳LT疾り及2指 Spark DataFrame Introduction