際際滷

際際滷Share a Scribd company logo
SPARK+AI
SUMMIT 2019
企觜ろ讀 蠍一螳覦
覦 (bigcastle@inavi.kr)
APRIL 23 - 25 | SAN FRANCISCO
SPARK+AI SUMMIT 2019
Koalas Scheduling Policies
Nested Columns
Recommendation System
Design Structured Streaming Pipelines
Data Pipelines
Real-Time Analytics
Neptune

Experienced Things
SPARK+AI SUMMIT 2019
OVERVIEW
  企語Ф
 Whats Next for Apache Spark
 Databricks Platforms
 Streaming Data Pipelines
 Other Sessions
SPARK+AI SUMMIT 2019
 企語Ф
SPARK+AI SUMMIT 2019
 企語Ф
SPARK+AI SUMMIT 2019
 企語Ф
WHATS
NEXT FOR
APACHE SPARK
SPARK+AI SUMMIT 2019
SPARK 1.0
SPARK 2.0
2014 2016
SPARK 3.0
2019
(expected)
1000螳讌 伎
蠍磯 覦 覯蠏 
APACHE SPARK 3.0
SPARK+AI SUMMIT 2019
APACHE SPARK DESIGN PRINCIPLES
1 Unify Data + AI
2
3
Run Everywhere
Easy-to-use APIs
SPARK+AI SUMMIT 2019
APACHE SPARK DESIGN PRINCIPLES
1 Unify Data + AI
2
3
Run Everywhere
Easy-to-use APIs
Spark MLLib
PYTORCH
TensorFlow
mxnet
CNTK

Tracking,
Management
SPARK+AI SUMMIT 2019
APACHE SPARK DESIGN PRINCIPLES
1 Unify Data + AI
2
3
Run Everywhere
Easy-to-use APIs
Spark
Standalone Mode
SPARK+AI SUMMIT 2019
APACHE SPARK DESIGN PRINCIPLES
1 Unify Data + AI
2
3
Run Everywhere
Easy-to-use APIs
2013: APIs for data engineers
2015: APIs for data engineers & scientists
SPARK+AI SUMMIT 2019
APACHE SPARK DESIGN PRINCIPLES
1 Unify Data + AI
2
3
Run Everywhere
Easy-to-use APIs
Typical journey of a Data Scientist
Education,
Analyze Small Datasets
PANDAS SPARK
Analyze Large Datasets
Koalas - Pandas Dataframe API on Spark
DATABRICKS
PLATFORMS
SPARK+AI SUMMIT 2019
DELTA LAKE
 一危磯 覿  譴觜 讌 
螻螳 一危 企Ν 企欧
殊 一危
企語Ф 一危
SPARK+AI SUMMIT 2019
DELTA LAKE
 Delta Lake - Open Source Project
螻螳 一危 企Ν 企欧
殊 一危
企語Ф 一危
Delta Lake
ACID Transactions
Unified Streaming & Batch
Scalable Metadata handling
Time Travel
Schema enforcement
SPARK+AI SUMMIT 2019
DELTA LAKE
 覲旧″ -ろ豎襯  
Event
SPARK+AI SUMMIT 2019
DELTA LAKE
 覲旧″ -ろ豎襯  
Event
SPARK+AI SUMMIT 2019
DELTA LAKE
 覲旧″ -ろ豎襯  
Event
df.write.format(parquet).save(data)
df.write.format(delta).save(data)
SPARK+AI SUMMIT 2019
DELTA LAKE
 Pros
 Full ACID Transactions 讌
 一危 覯
 覦一 & ろ碁Μ覦 牛
 蠍一ヾ Apache Spark API 100% 誤
 企 ろる 覲蟆 螳
 Cons
 Apache Spark 2.4.2 伎 讌
 所鍵 焔レ 願鍵  譯手鍵朱 Compaction(Merge)  .
 朱  蠍磯レ Managed 覯襷 讌螻 OSS覯   .
(ロ 讌螻)
SPARK+AI SUMMIT 2019
ML FLOW: OPEN SOURCE ML PLATFORM
 ML Lifecycle
Raw Data Data Preparation Training Deployment
SPARK+AI SUMMIT 2019
ML FLOW: OPEN SOURCE ML PLATFORM
 ML Lifecycle
Raw Data Data Preparation Training Deployment
AWS S3
Hadoop
Delta Lake
MongoDB
Kafka

Apache Spark
SQL
Python
Pandas
Scikit-learn

Apache Spark
PYTORCH
XGBoost
TensorFlow
R

Docker
Apache Spark
AWS SageMaker
Mobile Phone
.
Model Exchange
SPARK+AI SUMMIT 2019
ML FLOW: OPEN SOURCE ML PLATFORM
 Custom ML Platforms
Facebook FBLearner
Uber Michelangelo
Google TFX
Samsung Brightics AI
Dataiku
+ ML Cycle  朱 貅 蠏碁 螳碁 郁鍵襷 覃 .
- 螻襴讀/ 曙 蟇磯 朱 覃語襷
SPARK+AI SUMMIT 2019
ML FLOW: OPEN SOURCE ML PLATFORM
 MLflow Tracking : experiment tracking
 MLflow Projects : reproducible runs
 MLflow Models : model packaging
SPARK+AI SUMMIT 2019
ML FLOW: OPEN SOURCE ML PLATFORM
mlflow.log_param(lambda, 0.5)
mlflow.log_metric(rmse, 0.2)
 螳 貊襷 l伎朱
Managed by Databricks
Docs : mlflow.org
STREAMING
DATA PIPELINES
SPARK+AI SUMMIT 2019
谿瑚 語
 FIS - Life Is but a Stream
 SpotX - Spark Streaming
 COMCAST - Winning the Audience With AI
 Databricks - Productizing Structured Streaming Jobs
 Spark Commiter - Designing Structured Streaming Pipelines
 Eventbrite - Near Real-Time Analytics With Apache Spark
 Sparkflows.Io - Self-Service Apache Spark Structured Streaming
Applications & Analytics
SPARK+AI SUMMIT 2019
CASE: FIS GLOBAL
 1968 谿暑暑 讌 53,000覈 蠍 語企ゼ 襷 蠍一
 譯 螻螳 螻 蠍牛
BUSINESS
INTELLIGENCE HYBRID ETL PURE
STREAMING
SPARK+AI SUMMIT 2019
CASE: FIS GLOBAL
豐谿所鍵 危殊
STREAMING EVOLUTION
SPARK+AI SUMMIT 2019
CASE: FIS GLOBAL
 危 Streaming  Databricks Platform朱 覲伎譴
 https://github.com/KevenMellott91/spark-summit-2019-demo
SPARK+AI SUMMIT 2019
CASE: SPOTX
 Spark Streaming, DStream, Structured Streaming襯 る
 螳覦 , Small Datasets 螳讌螻 Local Mode 螳覦 蟆 螳譟壱
 螳覦蟲 IntelliJ SBT襯 豢豌
 ろ .queueStream() 螻 螳  ろ碁ゼ 讌
 覈磯 襴る襯 る殊企 mysql, influxdb, grafana襯 牛 覈磯
 Kafka Offset 蟯襴 覦一襭 襴る襯 る殊企  MySQL offset 蠍磯
螻 曙伎  襦 
 覈螳讌  れ 螻旧
 kafka auto commit off, rdd.compress, spark.storage.memoryFraction
SPARK+AI SUMMIT 2019
CASE: COMCAST
 瑚 螳  貅企 觜 覦′ 覦 ISP
 煙朱 貉豸襯 谿場譯朱 襷 企れ 
 螻螳 襯 蠍 企れ
 豐 糾 語螻 覦焔 碁 覦
一危 讌 豌襴
(豌襴)
語
(覿)
豕
(/覦壱)
豐谿所鍵 危殊
SPARK+AI SUMMIT 2019
CASE: COMCAST
 蠏碁蟆 覯螳 一給!
一危 讌 豌襴
(豌襴)
語
(覿)
豕
(/覦壱)
豐 1500襷 碁 覦
AWS S3 豐 3,500蟇伎
SPARK+AI SUMMIT 2019
CASE: COMCAST
 2谿  (覿一襴)
一危 讌 豌襴
(豌襴)
語
(覿)
豕
(/覦壱)
S3
S3
S3
S3
る 覿 


伎 蟇一!
640 Machines
32Jobs (2.5 PB)
key=1
key=2
key=3
SPARK+AI SUMMIT 2019
CASE: COMCAST
 2谿  (覿一襴): 蠏碁Μ螻 覯螳 一蠍 .
一危 讌 豌襴
(豌襴)
語
(覿)
豕
(/覦壱)
S3
ERR
S3
S3
る 覿 


640Machines
32Jobs (2.5 PB)
key=1
key=2
key=3
COMPLEX!!
FREQUENT FAILURES!!
UNMANAGEABLE!!
SPARK+AI SUMMIT 2019
CASE: COMCAST
 3谿  (Delta Lake): Scale, Reliability, Performance
一危 讌 豌襴
(豌襴)
語
(覿)
豕
(/覦壱)
S3
Auto Optimize
Delta Lake
Single Job
64 Machines
Enable Random Prefix
= No more Key Management
S3
Delta Lake
Auto Optimize
Delta Lake
Enable Random Prefix
SPARK+AI SUMMIT 2019
CASE: COMCAST
 3谿  (Delta Lake): Scale, Reliability, Performance
一危 讌 豌襴
(豌襴)
語
(覿)
豕
(/覦壱)
S3
Auto Optimize
Delta Lake
Single Job
64Machines
Enable Random Prefix
= No more Key Management
S3
Delta Lake
Auto Optimize
Delta Lake
Enable Random Prefix
SPARK+AI SUMMIT 2019
CASE: COMCAST
 豢螳 覓語: Complex Development Environment of ML
れ 螳覦 蟆
PB  覦 一危
100螳讌 襷 覈
一危 螻狩螳 瑚 殊語
PYTORCH
XGBoost
Scikit-Learn
+
SLOW ITERATION
SPARK+AI SUMMIT 2019
CASE: COMCAST
 SELF-SERVICE AI
PYTORCH
XGBoost
Scikit-Learn
Delta Lake
一危磯 Delta Lake襦
螳覦蟲 螳 Data Replication  豕
 覈語 Databricks Workspace襦
Notebook ク讌 覈襦 覈 /螳覦
覈語 mlflow襦
Tracking, Packaging
焔 覈語 Kubeflow襦
 觜る 覦壱, 
企Ν 1覯朱 ろ, 貊 , 豢, 覦壱
SPARK+AI SUMMIT 2019
CASE: COMCAST
 襤一 覲: PB  一危磯ゼ 豌襴覃伎 螳 覦讌 
 語ろ伎 10覦 螳: 640 -> 64!
  一 レ: 瑚 一危 螻狩れ 
 觜襯 : 覈 譯 蟇碁Μ 覦壱螳 5覿襷 螳ロ伎
SPARK+AI SUMMIT 2019
CASE: DATABRICKS
 Structured Streaming   覦 る 
(蟲豢/ろ/覈磯/覦壱)
 Data Pipelines @ Databricks
SPARK+AI SUMMIT 2019
CASE: DATABRICKS
 Bronze Table
 一危磯ゼ 螳螻牛讌 螻, 譴覲旧蟇一 JSON朱 覲 Parquet Format朱 
 襷曙 蟆曙磯ゼ 觜 一危一 る螳 朱 蠍  2譯 螳 覲願
 Silver Table
 10/100 螳 讌 貎朱Μ襯 襴伎  企
 螳語覲 煙 襷ろ麹螻 朱 蟲 一危磯ゼ ロ
 Gold Table
 Silver Table襦 覿 一/讌螻 企
 一危 伎語 朱覿 襷れ伎
SPARK+AI SUMMIT 2019
DESIGNING STRUCTURED STREAMING PIPELINES
 Tathagata Das (Spark Committer, PMC)
Spark 朱 batch-like 襦 ろ 螻 豕
SPARK+AI SUMMIT 2019
DESIGNING STRUCTURED STREAMING PIPELINES
 Streaming Pipelines り  3螳讌 讌覓語 語狩
How?
What? Why?
一危磯 覓伎 一危一瑚?
覓伎 蟆郁骸螳  螳?
朱 觜襯 旧 蟲螳?
豌襴 朱 螳?
 ろ碁Μ覦朱 豌襴伎狩螳?
蟆郁骸 蟲(/貉危)襯  蟇願?
語  蟆郁骸瑚?
企至 一危磯ゼ 豌襴 蟆瑚?
企至 蟆郁骸襯 ロ 蟆瑚?
SPARK+AI SUMMIT 2019
 覲企 襯
襷 豐襷 一危 蠍 癌
WHY?  企讌 朱 れ
 覈 覿/螳 
 豬る
襷 豐襷 一危  螳 
 襷 豐襷 レ襯
讌蠍 伎 襷り碓錫
(讌襷  一危磯 譯 レ螳 覦 )
 讌 一危一
蟆郁骸   譟一襯
 螳 
(一危磯 key-value ろ伎 螻 )
一危磯ゼ 襾語 旧
 蟇一錫
Key-value ろ企
一危 れ 朱 誤
 一危磯ゼ 豌襴 讌
SPARK+AI SUMMIT 2019
DESIGNING STRUCTURED STREAMING PIPELINES
 Streaming Design Patterns
How?
What?
Why?
SPARK+AI SUMMIT 2019
DESIGNING STRUCTURED STREAMING PIPELINES
 Streaming Design Patterns
How?
What?
Why?
觜 一危磯ゼ
蟲譟一 企 一危磯 覲蟇一
Latency : few minutes
蟲譟壱 豕 一危磯ゼ
誤磯磯蟆 讌蟇磯,
覦一  伎 
Structured Streaming  一危磯ゼ 
レ  螳ロ 蟲譟一 ろ襴讌襯   蟆.
Data Skipping 讌 .
=> Parquet, ORC, Delta Lake, or even better
OTHER
SESSIONS
SPARK+AI SUMMIT 2019
TensorFlow 2.0
 TensorFlow 2.0 High Level API - Keras
 Improved Debugger with Eager Execution
 Distribute Strategy - Easy to use Training on Multiple GPU
 Deploy Anywhere
 Server - TensorFlow Extended
 Edge Devices (Mobile) - TensorFlow Lite
 JavaScript - TensorFlow .JS
SPARK+AI SUMMIT 2019
Geospatial Analytics at Scale with Deep Learning and Apache Spark
 Databricks  覦
  煙讌( 企語Ф) ル 伎 谿襯 語螻
企ゼ 豌襴 讌 碁У 襦 訖れ朱 螻殊 伎手鍵
 Magellan  螳
SPARK+AI SUMMIT 2019
Geospatial Analytics: Magellan
 Geospatial 覿  覿 ろ 讌 ろ 殊企襴
 れ 襷血 讌
 ESRI, GeoJSON, OSM-XML, WKT
 蠍磯蓋 讌る碁Ν 一一 螳ロ
 Polygon intersection, Joining
 Spark SQL 讌 牛  狩襾殊る 碁煙るゼ 燕
SPARK+AI SUMMIT 2019
Geospatial Analytics: Magellan
https://github.com/harsha2010/magellan
SPARK+AI SUMMIT 2019
ETC
 Microsoft - Black in AI
 KPMG - Overview of the Recommend System
 Apple - Nested Columns Support in Parquet
 Netflex - Recommendation System Taste Cluster
 Neptune: Extended DAG Scheduler @ Spark 2.4 extension
 DASK - Distribution Parallel Computing in Python
SPARK+AI SUMMIT 2019
SUMMARY
 語れ 企  ク
 蠍一 る螻 襦 襯 襷 れ伎が
 Apache Spark襯  Summit
 Spark  覲螳  蟆朱 
 Hive 3 覲豌  覲螳 螻 讌 譴
Spark 2.3螻 2.4 谿企 襷れ 貉れ
 朱 Spark 瑚鍵 讌 蟆朱
覲伎
 Structured Streaming 覓 螳ロ.
 Lambda ろ豎 蟲 螳.
螳矧.

More Related Content

Spark Summit 2019

  • 1. SPARK+AI SUMMIT 2019 企觜ろ讀 蠍一螳覦 覦 (bigcastle@inavi.kr) APRIL 23 - 25 | SAN FRANCISCO
  • 2. SPARK+AI SUMMIT 2019 Koalas Scheduling Policies Nested Columns Recommendation System Design Structured Streaming Pipelines Data Pipelines Real-Time Analytics Neptune Experienced Things
  • 3. SPARK+AI SUMMIT 2019 OVERVIEW 企語Ф Whats Next for Apache Spark Databricks Platforms Streaming Data Pipelines Other Sessions
  • 8. SPARK+AI SUMMIT 2019 SPARK 1.0 SPARK 2.0 2014 2016 SPARK 3.0 2019 (expected) 1000螳讌 伎 蠍磯 覦 覯蠏 APACHE SPARK 3.0
  • 9. SPARK+AI SUMMIT 2019 APACHE SPARK DESIGN PRINCIPLES 1 Unify Data + AI 2 3 Run Everywhere Easy-to-use APIs
  • 10. SPARK+AI SUMMIT 2019 APACHE SPARK DESIGN PRINCIPLES 1 Unify Data + AI 2 3 Run Everywhere Easy-to-use APIs Spark MLLib PYTORCH TensorFlow mxnet CNTK Tracking, Management
  • 11. SPARK+AI SUMMIT 2019 APACHE SPARK DESIGN PRINCIPLES 1 Unify Data + AI 2 3 Run Everywhere Easy-to-use APIs Spark Standalone Mode
  • 12. SPARK+AI SUMMIT 2019 APACHE SPARK DESIGN PRINCIPLES 1 Unify Data + AI 2 3 Run Everywhere Easy-to-use APIs 2013: APIs for data engineers 2015: APIs for data engineers & scientists
  • 13. SPARK+AI SUMMIT 2019 APACHE SPARK DESIGN PRINCIPLES 1 Unify Data + AI 2 3 Run Everywhere Easy-to-use APIs Typical journey of a Data Scientist Education, Analyze Small Datasets PANDAS SPARK Analyze Large Datasets Koalas - Pandas Dataframe API on Spark
  • 15. SPARK+AI SUMMIT 2019 DELTA LAKE 一危磯 覿 譴觜 讌 螻螳 一危 企Ν 企欧 殊 一危 企語Ф 一危
  • 16. SPARK+AI SUMMIT 2019 DELTA LAKE Delta Lake - Open Source Project 螻螳 一危 企Ν 企欧 殊 一危 企語Ф 一危 Delta Lake ACID Transactions Unified Streaming & Batch Scalable Metadata handling Time Travel Schema enforcement
  • 17. SPARK+AI SUMMIT 2019 DELTA LAKE 覲旧″ -ろ豎襯 Event
  • 18. SPARK+AI SUMMIT 2019 DELTA LAKE 覲旧″ -ろ豎襯 Event
  • 19. SPARK+AI SUMMIT 2019 DELTA LAKE 覲旧″ -ろ豎襯 Event df.write.format(parquet).save(data) df.write.format(delta).save(data)
  • 20. SPARK+AI SUMMIT 2019 DELTA LAKE Pros Full ACID Transactions 讌 一危 覯 覦一 & ろ碁Μ覦 牛 蠍一ヾ Apache Spark API 100% 誤 企 ろる 覲蟆 螳 Cons Apache Spark 2.4.2 伎 讌 所鍵 焔レ 願鍵 譯手鍵朱 Compaction(Merge) . 朱 蠍磯レ Managed 覯襷 讌螻 OSS覯 . (ロ 讌螻)
  • 21. SPARK+AI SUMMIT 2019 ML FLOW: OPEN SOURCE ML PLATFORM ML Lifecycle Raw Data Data Preparation Training Deployment
  • 22. SPARK+AI SUMMIT 2019 ML FLOW: OPEN SOURCE ML PLATFORM ML Lifecycle Raw Data Data Preparation Training Deployment AWS S3 Hadoop Delta Lake MongoDB Kafka Apache Spark SQL Python Pandas Scikit-learn Apache Spark PYTORCH XGBoost TensorFlow R Docker Apache Spark AWS SageMaker Mobile Phone . Model Exchange
  • 23. SPARK+AI SUMMIT 2019 ML FLOW: OPEN SOURCE ML PLATFORM Custom ML Platforms Facebook FBLearner Uber Michelangelo Google TFX Samsung Brightics AI Dataiku + ML Cycle 朱 貅 蠏碁 螳碁 郁鍵襷 覃 . - 螻襴讀/ 曙 蟇磯 朱 覃語襷
  • 24. SPARK+AI SUMMIT 2019 ML FLOW: OPEN SOURCE ML PLATFORM MLflow Tracking : experiment tracking MLflow Projects : reproducible runs MLflow Models : model packaging
  • 25. SPARK+AI SUMMIT 2019 ML FLOW: OPEN SOURCE ML PLATFORM mlflow.log_param(lambda, 0.5) mlflow.log_metric(rmse, 0.2) 螳 貊襷 l伎朱 Managed by Databricks Docs : mlflow.org
  • 27. SPARK+AI SUMMIT 2019 谿瑚 語 FIS - Life Is but a Stream SpotX - Spark Streaming COMCAST - Winning the Audience With AI Databricks - Productizing Structured Streaming Jobs Spark Commiter - Designing Structured Streaming Pipelines Eventbrite - Near Real-Time Analytics With Apache Spark Sparkflows.Io - Self-Service Apache Spark Structured Streaming Applications & Analytics
  • 28. SPARK+AI SUMMIT 2019 CASE: FIS GLOBAL 1968 谿暑暑 讌 53,000覈 蠍 語企ゼ 襷 蠍一 譯 螻螳 螻 蠍牛 BUSINESS INTELLIGENCE HYBRID ETL PURE STREAMING
  • 29. SPARK+AI SUMMIT 2019 CASE: FIS GLOBAL 豐谿所鍵 危殊 STREAMING EVOLUTION
  • 30. SPARK+AI SUMMIT 2019 CASE: FIS GLOBAL 危 Streaming Databricks Platform朱 覲伎譴 https://github.com/KevenMellott91/spark-summit-2019-demo
  • 31. SPARK+AI SUMMIT 2019 CASE: SPOTX Spark Streaming, DStream, Structured Streaming襯 る 螳覦 , Small Datasets 螳讌螻 Local Mode 螳覦 蟆 螳譟壱 螳覦蟲 IntelliJ SBT襯 豢豌 ろ .queueStream() 螻 螳 ろ碁ゼ 讌 覈磯 襴る襯 る殊企 mysql, influxdb, grafana襯 牛 覈磯 Kafka Offset 蟯襴 覦一襭 襴る襯 る殊企 MySQL offset 蠍磯 螻 曙伎 襦 覈螳讌 れ 螻旧 kafka auto commit off, rdd.compress, spark.storage.memoryFraction
  • 32. SPARK+AI SUMMIT 2019 CASE: COMCAST 瑚 螳 貅企 觜 覦′ 覦 ISP 煙朱 貉豸襯 谿場譯朱 襷 企れ 螻螳 襯 蠍 企れ 豐 糾 語螻 覦焔 碁 覦 一危 讌 豌襴 (豌襴) 語 (覿) 豕 (/覦壱) 豐谿所鍵 危殊
  • 33. SPARK+AI SUMMIT 2019 CASE: COMCAST 蠏碁蟆 覯螳 一給! 一危 讌 豌襴 (豌襴) 語 (覿) 豕 (/覦壱) 豐 1500襷 碁 覦 AWS S3 豐 3,500蟇伎
  • 34. SPARK+AI SUMMIT 2019 CASE: COMCAST 2谿 (覿一襴) 一危 讌 豌襴 (豌襴) 語 (覿) 豕 (/覦壱) S3 S3 S3 S3 る 覿 伎 蟇一! 640 Machines 32Jobs (2.5 PB) key=1 key=2 key=3
  • 35. SPARK+AI SUMMIT 2019 CASE: COMCAST 2谿 (覿一襴): 蠏碁Μ螻 覯螳 一蠍 . 一危 讌 豌襴 (豌襴) 語 (覿) 豕 (/覦壱) S3 ERR S3 S3 る 覿 640Machines 32Jobs (2.5 PB) key=1 key=2 key=3 COMPLEX!! FREQUENT FAILURES!! UNMANAGEABLE!!
  • 36. SPARK+AI SUMMIT 2019 CASE: COMCAST 3谿 (Delta Lake): Scale, Reliability, Performance 一危 讌 豌襴 (豌襴) 語 (覿) 豕 (/覦壱) S3 Auto Optimize Delta Lake Single Job 64 Machines Enable Random Prefix = No more Key Management S3 Delta Lake Auto Optimize Delta Lake Enable Random Prefix
  • 37. SPARK+AI SUMMIT 2019 CASE: COMCAST 3谿 (Delta Lake): Scale, Reliability, Performance 一危 讌 豌襴 (豌襴) 語 (覿) 豕 (/覦壱) S3 Auto Optimize Delta Lake Single Job 64Machines Enable Random Prefix = No more Key Management S3 Delta Lake Auto Optimize Delta Lake Enable Random Prefix
  • 38. SPARK+AI SUMMIT 2019 CASE: COMCAST 豢螳 覓語: Complex Development Environment of ML れ 螳覦 蟆 PB 覦 一危 100螳讌 襷 覈 一危 螻狩螳 瑚 殊語 PYTORCH XGBoost Scikit-Learn + SLOW ITERATION
  • 39. SPARK+AI SUMMIT 2019 CASE: COMCAST SELF-SERVICE AI PYTORCH XGBoost Scikit-Learn Delta Lake 一危磯 Delta Lake襦 螳覦蟲 螳 Data Replication 豕 覈語 Databricks Workspace襦 Notebook ク讌 覈襦 覈 /螳覦 覈語 mlflow襦 Tracking, Packaging 焔 覈語 Kubeflow襦 觜る 覦壱, 企Ν 1覯朱 ろ, 貊 , 豢, 覦壱
  • 40. SPARK+AI SUMMIT 2019 CASE: COMCAST 襤一 覲: PB 一危磯ゼ 豌襴覃伎 螳 覦讌 語ろ伎 10覦 螳: 640 -> 64! 一 レ: 瑚 一危 螻狩れ 觜襯 : 覈 譯 蟇碁Μ 覦壱螳 5覿襷 螳ロ伎
  • 41. SPARK+AI SUMMIT 2019 CASE: DATABRICKS Structured Streaming 覦 る (蟲豢/ろ/覈磯/覦壱) Data Pipelines @ Databricks
  • 42. SPARK+AI SUMMIT 2019 CASE: DATABRICKS Bronze Table 一危磯ゼ 螳螻牛讌 螻, 譴覲旧蟇一 JSON朱 覲 Parquet Format朱 襷曙 蟆曙磯ゼ 觜 一危一 る螳 朱 蠍 2譯 螳 覲願 Silver Table 10/100 螳 讌 貎朱Μ襯 襴伎 企 螳語覲 煙 襷ろ麹螻 朱 蟲 一危磯ゼ ロ Gold Table Silver Table襦 覿 一/讌螻 企 一危 伎語 朱覿 襷れ伎
  • 43. SPARK+AI SUMMIT 2019 DESIGNING STRUCTURED STREAMING PIPELINES Tathagata Das (Spark Committer, PMC) Spark 朱 batch-like 襦 ろ 螻 豕
  • 44. SPARK+AI SUMMIT 2019 DESIGNING STRUCTURED STREAMING PIPELINES Streaming Pipelines り 3螳讌 讌覓語 語狩 How? What? Why? 一危磯 覓伎 一危一瑚? 覓伎 蟆郁骸螳 螳? 朱 觜襯 旧 蟲螳? 豌襴 朱 螳? ろ碁Μ覦朱 豌襴伎狩螳? 蟆郁骸 蟲(/貉危)襯 蟇願? 語 蟆郁骸瑚? 企至 一危磯ゼ 豌襴 蟆瑚? 企至 蟆郁骸襯 ロ 蟆瑚?
  • 45. SPARK+AI SUMMIT 2019 覲企 襯 襷 豐襷 一危 蠍 癌 WHY? 企讌 朱 れ 覈 覿/螳 豬る 襷 豐襷 一危 螳 襷 豐襷 レ襯 讌蠍 伎 襷り碓錫 (讌襷 一危磯 譯 レ螳 覦 ) 讌 一危一 蟆郁骸 譟一襯 螳 (一危磯 key-value ろ伎 螻 ) 一危磯ゼ 襾語 旧 蟇一錫 Key-value ろ企 一危 れ 朱 誤 一危磯ゼ 豌襴 讌
  • 46. SPARK+AI SUMMIT 2019 DESIGNING STRUCTURED STREAMING PIPELINES Streaming Design Patterns How? What? Why?
  • 47. SPARK+AI SUMMIT 2019 DESIGNING STRUCTURED STREAMING PIPELINES Streaming Design Patterns How? What? Why? 觜 一危磯ゼ 蟲譟一 企 一危磯 覲蟇一 Latency : few minutes 蟲譟壱 豕 一危磯ゼ 誤磯磯蟆 讌蟇磯, 覦一 伎 Structured Streaming 一危磯ゼ レ 螳ロ 蟲譟一 ろ襴讌襯 蟆. Data Skipping 讌 . => Parquet, ORC, Delta Lake, or even better
  • 49. SPARK+AI SUMMIT 2019 TensorFlow 2.0 TensorFlow 2.0 High Level API - Keras Improved Debugger with Eager Execution Distribute Strategy - Easy to use Training on Multiple GPU Deploy Anywhere Server - TensorFlow Extended Edge Devices (Mobile) - TensorFlow Lite JavaScript - TensorFlow .JS
  • 50. SPARK+AI SUMMIT 2019 Geospatial Analytics at Scale with Deep Learning and Apache Spark Databricks 覦 煙讌( 企語Ф) ル 伎 谿襯 語螻 企ゼ 豌襴 讌 碁У 襦 訖れ朱 螻殊 伎手鍵 Magellan 螳
  • 51. SPARK+AI SUMMIT 2019 Geospatial Analytics: Magellan Geospatial 覿 覿 ろ 讌 ろ 殊企襴 れ 襷血 讌 ESRI, GeoJSON, OSM-XML, WKT 蠍磯蓋 讌る碁Ν 一一 螳ロ Polygon intersection, Joining Spark SQL 讌 牛 狩襾殊る 碁煙るゼ 燕
  • 52. SPARK+AI SUMMIT 2019 Geospatial Analytics: Magellan https://github.com/harsha2010/magellan
  • 53. SPARK+AI SUMMIT 2019 ETC Microsoft - Black in AI KPMG - Overview of the Recommend System Apple - Nested Columns Support in Parquet Netflex - Recommendation System Taste Cluster Neptune: Extended DAG Scheduler @ Spark 2.4 extension DASK - Distribution Parallel Computing in Python
  • 54. SPARK+AI SUMMIT 2019 SUMMARY 語れ 企 ク 蠍一 る螻 襦 襯 襷 れ伎が Apache Spark襯 Summit Spark 覲螳 蟆朱 Hive 3 覲豌 覲螳 螻 讌 譴 Spark 2.3螻 2.4 谿企 襷れ 貉れ 朱 Spark 瑚鍵 讌 蟆朱 覲伎 Structured Streaming 覓 螳ロ. Lambda ろ豎 蟲 螳.