4. Requirements: Iterative computing
Optimization Methods in Machine Learning
Iterative methods
SGD for for LR(Stochastic Gradient Descent)
Newton’s XXXX methods
big-data :
? sophisticated algorithm, but long computing time
? simple algorithm,but error-prone,need iterative
and convergence
E.g. Recommendation System(Sohu Shipin)
Possible Requirements:
1.User CF(realtime session user log analysis)
2. Item CF(tag,label based,clustering?)
3.LFM (SGD based latent fact model)
4.Graph-based (PersonalRank in User-Item bipartgraph)
5. Before Spark
Issues in iterative computing
1. Many iterative process
2. Cached dataset or mediate result
Hadoop的问题
Hadoop 仅支持1次iteration,且每次迭代都要reload
data
iterative mapreduce
twist,haloop等
优化不成熟,也受限于计算模型
Mahout 本来只支持Hadoop,现在支持Spark,并加入
DSL语言与scala交互
interactive computing
Hive + Distributed cache
7. Other Requirements
1. Offline Batch Process
Hadoop Ecosystem(Map-Reduce)
Dryad(Microsoft DAG workflow+scope)
2. Stream Computing
Twitter Storm(Spout/Bolt Model)
Yahoo S4(Actors Model)
Google Percolator,IBM SPC
Kafka,Flume(Pub-Sub Pipe)
3. Iterative Computing
Spark Ecosystem (RDD,DAG)
Iterative MapReduce(Haloop,Twist)
Petuum(SSP model)
4. Graph Computing(Iterative+Messaging,e.g. PageRank)
Google Pregel(BSP model)
Apache Giraph(BSP model)
Apache HAMA(BSP+Matrix)
32. Deployment issues: Submit,deploymode,master
Your Program
sc = new SparkContext
f = sc.textFile(“hdfs://…”)
f.filter(…)
.count()
...
cluster model on yarn
Driver Program
on Worker
./bin/spark-submit
--class <main-class>
--master <master-url> (local;local[K];local[*],spark://..;mesos://..;yarn-client;yarn-master)
--deploy-mode <deploy-mode> (cluster;yarn)
--conf <key>=<value>
... # other options
<application-jar>
[application-arguments]
用yarn.Client wrapper你程序的main
client model on mesos
Driver Program
on Client
client model on alone
Driver Program
on client
client model on yarn
Driver Program
on Client
由Yarn的resourceManager为spark
ApplicationMaster申请资源,
运行到一个节点上,再启动Driver Programe
Your Program
sc = new SparkContext
f = sc.textFile(“hdfs://…”)
f.filter(…)
.count()
...
用spark.deploy.Client wrapper
cluster model
Driver Program
on Standalone
cluster model on mess
Driver Program
on Mesos worker
直接由spark-submit在client
本地调用,启动driver,后面也是依赖
yarn-client和applicationMaster
Your Program
sc = new SparkContext
f = sc.textFile(“hdfs://…”)
f.filter(…)
.count()
...
Your Program
sc = new SparkContext
f = sc.textFile(“hdfs://…”)
f.filter(…)
.count()
...
直接由spark-submit在client
本地调用
直接由spark-submit在client
本地调用
p.s. driver program需要和worker的网络互通
Client
Worker Client Client Clien
33. Deployment Issues: Security(TODO)
Spark Security
Spark currently supports authentication via a shared secret. Authentication can be configured
? For Spark on YARN deployments, configuring spark.authenticate to true will automa
? For other types of Spark deployments, the Spark parameter spark.authenticate.secr
? IMPORTANT NOTE: The experimental Netty shuffle path (spark.shuffle.use.netty)
39. Case 2: Real-Time Log Aggregation & Analysis
http://spark-summit.org/wp-content/uploads/2013/10/Spark_Summit_2013_Jason_Dai.pdf
40. 1. 西班牙电信 实时数据处理
Kafka 做数据集成
数据预处理:使用storm
批处理:使用Cassandra + Spark
2. Spark in Taobao
a)Spark on Yarn
b)Spark Streaming
c)GraphX
Case 3: Other Cases
41. Tips & Tricks By twitter (2014.4)
遇到过的问题:(TODO 0.81版本后的更新)
1. Spilling Data to disk is a work in progress
2. OOM : task 太大
a) 使用split size 比HDFS Block要小得多
b) 如果不用caching, 减少spark.storage.MemoryFraction
c) 增加reducer的并行度
3. YARN仅仅支持静态资源划分,如何回收
4. Multi-tenancy 多租户情况的clean 和标准的方式
5. Long running SparkContext as a service: Spark job Server
6. Failure Mode, 需要重算,对小作业ok,对大作业不好
7. 主要是task failure不是executor faill
8. 使用spork
9. 主要是Spark on Yarn
42. Tips & Tricks By Sony (2014.4)
遇到的问题:
1. JVM issue: OOM,对executeMemory和spark.storage/shuffle.memoryFraction
2. Full GC
3. Assembly Jar Size