狠狠撸

Spark Machine Learning
Deep Dive
(feat. Deep Learning)
???(Spark Korea User Group)
???

Who am I ?
? ??? Chief Partner ( hoondongkim@emart.com )
? ??? ?? ??? ?? SSG.COM ????? ??
? Hadoop, Spark, Machine Learning, Azure ML ??
Microsoft MVP(Most Valuable Professional)
? Major in BigData RealTime Analytics & NoSQL
? http://hoondongkim.blogspot.kr

I will say …
? Spark Cluster ?? ?? ? ??? ??
? Spark? ?? Machine Learning ? ???
? ??? ??
? Machine Learning & Deep Learning ?????
Spark? ??
? Spark ? ???? ??? ?? ? ???
? Spark ???? Deep Learning ??? ??? ??

So What?
He said “??? ?????”

???, ?? ?? ???? ??!
– ?? ?? ??
? ??? ?? ??. (BigData Eco System Infra)
? ??? ??? ???.
? ???? ?? ?? ??? ?? ???? ???.
? ???? ??? ?? ??? ??? ?? ???? ??.
? ??? ?? ??. (RealTime Layer / Spark Streaming / ELK)
? ????? ??? ???? ????.
? FDS, ????, ???? ? ????? ?? ??? ???? ??? ????? ????.
? ??? ?? ??. (Mining / Machine Learning / Deep Learning)
? ??? ?? ?? ?? ? ? ? ?? ?? ????.
? ??? ??? ?? ? ?? ?? ??? ?? ROI ?? ????.
? ??? ????.
? ??? ?? ??? ????.
? ??? ??? ?? ??? ???? ??? ??? ????.
? ??? ?? ??. (Machine Learning / Deep Learning)
? Chatbot
? ??? ??, ??? ?? -> ?? ? ?? ???
? ??
3~4? ?
2~3? ?
??

BigData + Deep Learning Approach

???? AI??? ??? ??!
??
Math ? ??,????,???? ??? PH.D ? ???? ? ???
?? ?? ? Develop Sprit ?? ?? ??? Base? Data Scientist?
???? ??? ??? ???!
??…

?? ???? ????!
vs
?? ??? ????!
Are you Data Scientist?
??? ???? ???
??? ?? ???? ???
? ??? ? ??? ????

??? ???? Deep Learning ???
? ?? ? ??
? ??? ? ??? ?….
? ??? ??? ??
? ??? ??? , ???
? ???? ??(?? ?? ??)
? Deep Learning ? Low Level ?? ??? ? ??.
? Deep Learning ?? CNN, RNN, RNN?? , RL ?? ??? ?? ???? ??.
? ??? Machine Learning
? Markov-Chain Monte Carlo
? Gibbs Sampler
? Variation Inference
? Deep Belief Network
? ??? Deep Learning
? CNN
? RNN
? LSTM …
Mathematical formula
Engineering Art

Spark ? Position!
- Spark ???? BigData Scale ML Job? ???? ????!

Spark ML ? ????? ?? ??? ?
?? ??! ?? ?????

Spark machine learning & deep learning

Spark ??
? Scala
? Java
? Python
? R

Spark Machine Learning (Mllib)

??? ?? ?? ??
? Scala + Java
? PySpark + Python
? SparkR + RevoScaleR(MS R) + CRAN-R

Spark ML ? ????? ?? ??? ?
?? ??! ?? ?????
??? ????? ??? Java ? ? ?? ?. (example github source ? ??…)
???? ??? ??? Java ? ?????? ??.
???? ?? Main routine ? Scala ?…

?? ? ??
? Prediction IO Example
? ???? ML Model ? implementation ? ???? ???? ??.
? Prediction IO ??? Full Range ? ? ???? ?? ??.

R on Spark ? ???
? SparkR
? Sparklyr
? RevoScaleR(MS R)

RevoScaleR on Spark ????
1 Machine on MS-SQL Server 7 Machine on Spark Cluster
Y? : Elapsed Time ?? ?? ?? ??.
8Core – 65GB Memory. 7 Machine.

Python Machine Learning on Spark

Spark & Deep Learning
? Deep Water ? ?

Spark & Deep Learning
? BigDL ?

Spark Deep Learning Deep Dive
? Keras + Tensorflow + Spark : elephas

Spark ? Position!
- Spark ???? BigData Scale ML/DL Job? ???? ????!
R, SAS , SPSS
Matlab
R, Python
BI, OLAP
Spark
Spark ML
Spark Streaming
Spark GraphX
Hadoop
NoSQL
Python, R
Spark R
Revolution R
Tensorflow
Keras
CNTK ??
H2O, Weka
Deeplearning4j

Spark ML ?? Pain Point ???? #1
? ??? & CPU ??? ?? ??(?? ??? ??? ?? ????? ???.)
? ML ??? CPU ??? ? ??
? Yarn mode vs Mesos Mode vs Stand Alone
? Hadoop ??? ?? eco system ? 1/5 ? 1/3? ?? mesos ? ? ???? ?????
??. ?? ??? ??? ??? ???? ?? ??? ??? ?? ??? ???? ?
?.
? Mesos ? Off heap ? ???? ?? ? ??. CPU? Memory, Executor ? ???? ?
? ???? ????. ??? Job ???? ?? Elastic ?.( yarn ?? ? ??? ???
? ??. Fine Grain ????). ?? Yarn ? Dynamic Allocation ?? ? ????? ??.
?? ???? ??? ??? ? ??. Mesos ? ?? Dynamic ?. But ??? ?? ?
?? ??. ? Dynamic ??? Overhead ? ? ?? ?. ? ??? Spark 2.x ??? ?
???. Mesos ? Dynamic Allocation ????? ????.) -> ??? ? ?????
run ??? mesos ? spark.scheduler.mode = FAIR ??? ??? ??, Yarn ? Yarn ?
??? ??? Fair ????? ???? ?? ??? ??? ??.

? ???? ?? ??? ???? Production ??. (ML job ? ???? ???
?? ?? ??)
? ?? Job ? ?? ???? Full ???? Serial ????? ??? ?? ??. ??
??? CPU ? ?? ??? Memory ? ?? ?? ???…. ?? Job ? ??
??? ?? ???? ??( ?? ???? OnPremise ?? ??? ?????)
? ?? ??? ??? ? ??. ?? ?? 10??? ?? ?? ???? ???
? ?? ??, ??? ?? Long Running Adhoc ??? ?? ?? ??? ??
? ??? ?? ?? ?? ??, ??? ?? ?? ??? ??. ??? dynamic
allocation ? ?? ???? ??. (job ? ??? ?? ???? fine graind
mode.) ????? ?? ??? ?? ??? coarse graind mode. ? dedicate
?? ??. ?? ?? hw zone ?? ??? ?? ???. (?, Spark Streaming ?
??… ???? ?? ??)

? Disk ?? ??
? Machine Learning ? ?? ?? ????? ??, ?? ??? Input
???? ????, ??? ??? ?? Node ? ?? Disk ? Full ?
????? ???? ??? ??.
? [???]
? Hyper-Parameter ? ?? Trade Off
? Hybrid Cloud ??? ??
? Mesos Cluster ?? ( Yarn Cluster ??? Hadoop Node ??? dependency?
???, Mesos Cluster ? ?? ? ??? ?? ? ??. ?, NoSQL Cluster ? ?
? ?? )

? IO ?? ??
? ???? Computer Science ??? ?? Spark ??? IO ??? ??
? ??.
? ?? ??? ???? ?? File Operation ? ?? ? IO ??? ??
? ???, ?? ?? ?? ???? Network Card ? ????? ?
? ??.
? 1G ??? 10G ? ??. ?? ??? 10G 2? ?? 4? ? ?? ??
?? ?? ??.
? Spark ? ?? ? ?? 15~20% ??? ?? ??? ??.(?? ???
??? ??? ??)
? Network ?? ??? ??? ?????, 20% ?? ?? ??? ??
??? ?? ??. ??? ??? ??? CPU, Memory , IO ?? ??
??? ??.

? BigData Scale Data Load ??? ????
? ??? Map/Reduce ? ? ??.
? ???? ?? Choice
? Hadoop Streaming
? Pig, Hive ?
? Clojure ??? Cascalog, Scala ??? Scalding
? ? Map/Reduce Wrapping ??? Data ? ???, ?? ?? ???
? Spark ML ??.
? ???? ??? BigData Scale ? Input ? ???? ??
? Popular ????? Mahout? ?? (?? ?? ?? ? ??? ? ??)
? Weka, DeepLearning4J , H2O , Sparkling Water ? ?? Tool ? Support ??
? ??

? ?? ???
? ?? Model ? Hyper Parameter ? ??? ??? ??? ?.
? Driver Memory, Executor Memory ? ??? ??? ??? ?.
? Data ? BroadCast.
? ??? ?? Heavy Computing ? ??????, ?? ?? ? ???? ?.

? ?? ?? ??? ?? ??? ?????? ?? IO Over Head ??
? File Write ????? Map ? RDD.saveAsTextFile(HDS_PATH) ? ??
? Spark ML / Spark SQL Data Load ?????
RDD.coalesce(1).saveAsTextFile(HDS_PATH) ? ??
? File ? ???? ?? ??? coalesce(1) ?? Write ?? roading ??
??, writing ?? Map/Reduce ? Merge ? ? ??.

? ???? ?? ??
? ?? ? feature selection ???? ??? ????? ??? ??? ???
?? ??? ? ??? ??.
? Spark ML ? ???? ?????? ??? ?? Popular ??, ??? ??
????? ?? ????? 1? ?? ??.
? ?? ??? ??? ?? ???? ???? ??.
? ???? Selection ? Python ?? R ? Support ????? ? ??? ???? ?
??.
? ????? ????, ?? ????? Spark ML ? ???? ??
? ???? ?? Spark ML ? Production ? ??.
? ?, R ? 2? 3? ??? ??? Spark ML ? ?? ?? ? ??.
? ?, ??? ?? ???? ?? ???? ?? ??? ??. Production ??
?? ?? ??? ??.

? ????? ????? ???? ?? ??
? ?? ????? ?????, ?? ???? ????, ?? ?? ??
?? ???? ?? ??? ??.
? Word2Vec ? ?
? FP-Growth ? ?

? ????? ???? ?? ??. (ML ? ??)
? ????? ???? ?? ??. (Deep Learning ? ??)

??.
? Deep Learning ? ??
? http://ankivil.com/choosing-a-deep-learning-software/
? ??? ????? ?? ??
? https://tensorflow.blog/2017/02/13/chainer-mxnet-cntk-tf-
benchmarking/
? Keras
? ?? 2 ?? ?? ??.
? CNTK ? Keras ???? ? ?.
? Keras ? Tensorflow ?? ???? ????

狠狠撸

Spark machine learning & deep learning

Recommended

More Related Content

What's hot (20)

Viewers also liked (10)

Similar to Spark machine learning & deep learning (20)

Spark machine learning & deep learning

Editor's Notes