10. SPARK+AI SUMMIT 2019
APACHE SPARK DESIGN PRINCIPLES
1 Unify Data + AI
2
3
Run Everywhere
Easy-to-use APIs
Spark MLLib
PYTORCH
TensorFlow
mxnet
CNTK
Tracking,
Management
11. SPARK+AI SUMMIT 2019
APACHE SPARK DESIGN PRINCIPLES
1 Unify Data + AI
2
3
Run Everywhere
Easy-to-use APIs
Spark
Standalone Mode
12. SPARK+AI SUMMIT 2019
APACHE SPARK DESIGN PRINCIPLES
1 Unify Data + AI
2
3
Run Everywhere
Easy-to-use APIs
2013: APIs for data engineers
2015: APIs for data engineers & scientists
13. SPARK+AI SUMMIT 2019
APACHE SPARK DESIGN PRINCIPLES
1 Unify Data + AI
2
3
Run Everywhere
Easy-to-use APIs
Typical journey of a Data Scientist
Education,
Analyze Small Datasets
PANDAS SPARK
Analyze Large Datasets
Koalas - Pandas Dataframe API on Spark
21. SPARK+AI SUMMIT 2019
ML FLOW: OPEN SOURCE ML PLATFORM
ML Lifecycle
Raw Data Data Preparation Training Deployment
22. SPARK+AI SUMMIT 2019
ML FLOW: OPEN SOURCE ML PLATFORM
ML Lifecycle
Raw Data Data Preparation Training Deployment
AWS S3
Hadoop
Delta Lake
MongoDB
Kafka
Apache Spark
SQL
Python
Pandas
Scikit-learn
Apache Spark
PYTORCH
XGBoost
TensorFlow
R
Docker
Apache Spark
AWS SageMaker
Mobile Phone
.
Model Exchange
23. SPARK+AI SUMMIT 2019
ML FLOW: OPEN SOURCE ML PLATFORM
Custom ML Platforms
Facebook FBLearner
Uber Michelangelo
Google TFX
Samsung Brightics AI
Dataiku
+ ML Cycle 朱 貅 蠏碁 螳碁 郁鍵襷 覃 .
- 螻襴讀/ 曙 蟇磯 朱 覃語襷
24. SPARK+AI SUMMIT 2019
ML FLOW: OPEN SOURCE ML PLATFORM
MLflow Tracking : experiment tracking
MLflow Projects : reproducible runs
MLflow Models : model packaging
25. SPARK+AI SUMMIT 2019
ML FLOW: OPEN SOURCE ML PLATFORM
mlflow.log_param(lambda, 0.5)
mlflow.log_metric(rmse, 0.2)
螳 貊襷 l伎朱
Managed by Databricks
Docs : mlflow.org
36. SPARK+AI SUMMIT 2019
CASE: COMCAST
3谿 (Delta Lake): Scale, Reliability, Performance
一危 讌 豌襴
(豌襴)
語
(覿)
豕
(/覦壱)
S3
Auto Optimize
Delta Lake
Single Job
64 Machines
Enable Random Prefix
= No more Key Management
S3
Delta Lake
Auto Optimize
Delta Lake
Enable Random Prefix
37. SPARK+AI SUMMIT 2019
CASE: COMCAST
3谿 (Delta Lake): Scale, Reliability, Performance
一危 讌 豌襴
(豌襴)
語
(覿)
豕
(/覦壱)
S3
Auto Optimize
Delta Lake
Single Job
64Machines
Enable Random Prefix
= No more Key Management
S3
Delta Lake
Auto Optimize
Delta Lake
Enable Random Prefix
49. SPARK+AI SUMMIT 2019
TensorFlow 2.0
TensorFlow 2.0 High Level API - Keras
Improved Debugger with Eager Execution
Distribute Strategy - Easy to use Training on Multiple GPU
Deploy Anywhere
Server - TensorFlow Extended
Edge Devices (Mobile) - TensorFlow Lite
JavaScript - TensorFlow .JS
50. SPARK+AI SUMMIT 2019
Geospatial Analytics at Scale with Deep Learning and Apache Spark
Databricks 覦
煙讌( 企語Ф) ル 伎 谿襯 語螻
企ゼ 豌襴 讌 碁У 襦 訖れ朱 螻殊 伎手鍵
Magellan 螳
53. SPARK+AI SUMMIT 2019
ETC
Microsoft - Black in AI
KPMG - Overview of the Recommend System
Apple - Nested Columns Support in Parquet
Netflex - Recommendation System Taste Cluster
Neptune: Extended DAG Scheduler @ Spark 2.4 extension
DASK - Distribution Parallel Computing in Python