my experience in spark tunning. All tests are made in production environment(600+ node hadoop cluster). The tunning result is useful for Spark SQL use case.
Build 1 trillion warehouse based on carbon databoxu42
?
Apache CarbonData & Spark Meetup
Build 1 trillion warehouse based on CarbonData
Huawei
Apache Spark? is a unified analytics engine for large-scale data processing.
CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!
my experience in spark tunning. All tests are made in production environment(600+ node hadoop cluster). The tunning result is useful for Spark SQL use case.
Build 1 trillion warehouse based on carbon databoxu42
?
Apache CarbonData & Spark Meetup
Build 1 trillion warehouse based on CarbonData
Huawei
Apache Spark? is a unified analytics engine for large-scale data processing.
CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!
Hadoop con 2015 hadoop enables enterprise data lakeJames Chen
?
Mobile Internet, Social Media 以及 Smart Device 的發展促成資訊的大爆炸,伴隨產生大量的非結構化及半結構化的資料,不但資料的格式多樣,產生的速度極快,對企業的資訊架構帶來了前所未有的挑戰,面對多樣的資料結構及多樣的分析工具,我們應該採用什麼樣的架構互相整合,才能有效的管理資料生命週期,提取資料價值,Hadoop 生態系統,無疑的在這個大架構裡,將扮演最基礎的資料平台的角色,實現企業的 Data Lake。
19. Cluster
基本架構
分散式計算簡介
Worker Node
Worker Node
Worker Node
Worker Node
Data
Data
Data
Executor
Executor
Executor
Client
Cluster
Manager
1. 叢集管理理節點安排?
?工作節點
2. 暫存資料於記憶體
Scheduling
Logistic?
Regressioon
19
Cache
Data
Cache
Data
Cache
Data
20. Cluster
基本架構
分散式計算簡介
Worker Node
Worker Node
Worker Node
Worker Node
Data
Data
Data
ExecutorTaskTask
ExecutorTaskTask
ExecutorTaskTask
Client
Cluster
Manager
1. ?工作節點執?行行指定?
任務
2. 此階段可能發?生節?
點間的資料交換
Scheduling
Logistic?
Regressioon
20
Cache
Data
Cache
Data
Cache
Data
21. Cluster
基本架構
分散式計算簡介
Worker Node
Worker Node
Worker Node
Worker Node
ExecutorTaskTask
ExecutorTaskTask
ExecutorTaskTask
Client
Cluster
Manager
Data
Data
Data
ExecutorTask
Reduce
1. 結合各?工作節點的?
計算結果
Logistic?
Regressioon
21
Cache
Data
Cache
Data
Cache
Data
22. Cluster
基本架構
分散式計算簡介
Worker Node
Worker Node
Worker Node
Worker Node
ExecutorTaskTask
ExecutorTaskTask
ExecutorTaskTask
Client
Cluster
Manager
Data
Data
Data
ExecutorTask
1. 將最終計算結果回?
傳給使?用者
2. 根據需求,判斷是?
否進?行行下次迭代
Logistic?
Regressioon
22
Cache
Data
Cache
Data
Cache
Data
23. 基本架構
RDD 物件:Resilient Distributed Dataset
23
?用?戶端
RDD 物件
1. Create
2. Tranformation
3. Action
4. Cache
Work Node
Cache
Data
Work Node
Cache
Data
Work Node
Cache
Data Data
Data
Data