際際滷

際際滷Share a Scribd company logo
DataWorks Summit 2017
SAN JOSE, USA JUNE 13-15
 蠍一螳覦
覦
KEYPOINT
2
譯殊 
Streaming
Processing
Machine
Learning
 Apache NiFi & MiNiFi
 Tensorflow On Spark
 Caf辿 On Spark
 Anomaly Detection
 Fraud Prevention
 Financial Crime Detection
 Traffic Prediction
 Apache SPARK
 Deep Learning
 Apache Kafka
Interactive
Processing
& Analysis
 Apache Zeppelin
 IBM (Data Science Expericence)
 CASK
 Infoworks
CONTENTS TECHNICAL SESSION
Druid
TensorflowOnSpark
Apache Kudu
Hadoop Query Performance Smackdown
Apache Calcite
Apache Beam
The Future Of Apache Ambari
INTERACTIVE PROCESSING & ANALYSIS
3
4
DRUID
 螻(Time Series) 一危 豌襴 麹 Data Store Engine.
 Apache Open Source License伎, 讌 Apache 襦碁 .
 殊 讌 れ螳 螻 一危磯ゼ 企至 れ螳朱 觜襯
蟆 譟壱   蟾 朱 螻覩殊朱 .
5
DRUID
 螻  一危磯れ 譯  Queryれ 危エ覲企 れ螻
螳 襦螳 襷 覦.
 蠏碁 螳 蠍一朱 一 .
SELECT year, month, city, SUM(sales)
FROM table
WHERE month IN (12, 11, 10) AND year=2017 AND state='CA'
timestamp publisher advertiser click price
2017-01-01T00:00:35Z thinkware.com google.com 0 0.65
2017-01-01T00:07:35Z thinkware.com google.com 1 0.45
2017-01-01T01:02:00Z facebook.com google.com 0 0.82
2017-01-01T02:36:00Z facebook.com google.com 1 1.53
 朱朱 瑚係襾狩  5覦焔 朱 
 襭企 螳 瑚係襾狩碁 貉  蠍磯朱
6
DRUID
 貉殊 一危一 id襯 覿 一危磯ゼ 豢 (String Type襷 讌)
 thinkware.com -> 0, facebook.com -> 1
 google.com -> 0
timestamp publisher advertiser click price
2017-01-01T00:00:35Z thinkware.com google.com 0 0.65
2017-01-01T00:03:11Z thinkware.com google.com 0 0.62
2017-01-01T00:07:35Z thinkware.com google.com 1 0.45
2017-01-01T01:02:00Z facebook.com google.com 0 0.82
2017-01-01T02:01:00Z facebook.com google.com 0 0.91
2017-01-01T02:36:00Z facebook.com google.com 1 1.53
 ル 
{publisher} -> [0, 0, 0, 1, 1, 1]
{advertiser} -> [0, 0, 0, 0, 0, 0]
7
DRUID
 一危磯ゼ 譟壱蠍一 INDEX 螻襴讀 BITMAP INDEX襯 
 煙ロ 豺襯 蠍磯
timestamp publisher advertiser click price
2017-01-01T00:00:35Z thinkware.com google.com 0 0.65
2017-01-01T00:03:11Z thinkware.com google.com 0 0.62
2017-01-01T00:07:35Z thinkware.com google.com 1 0.45
2017-01-01T01:02:00Z facebook.com google.com 0 0.82
2017-01-01T02:01:00Z facebook.com google.com 0 0.91
2017-01-01T02:36:00Z facebook.com google.com 1 1.53
thinkware.com -> [111000]
facebook.com -> [000111]
8
DRUID
 BITMAP INDEX OR 一 
thinkware.com -> [111000]
facebook.com -> [000111]
SELECT * FROM table
WHERE publisher= 'thinkware.com' OR publisher='facebook.com'
thinkware.com
OR
facebook.com
[111000]
OR
[000111]
[111111]
"1~6螻 襷れ広"
9
DRUID
"Druid is NOT time series DB"
Druid 覲 一危磯ゼ ロ讌 螻 蠍一ヾ 一危一 indexing 覲企 .
BROKER
REALTIME
HISTORICAL HDFS
CLIENT
DATA STREAM
HAND OFF
INDEXING
INDEXING
10
DRUID
 BENCHMARK
 Dataset : TPC-H 100G lineitem (600M Rows)
 AWS EMR 5.0.0, r3.4xlarge (2.5GHz * 16, 122G, 320G SSD) * 6 workers
AGGREGATION QUERY
(SUM, COUNT, LIKE, GROUP BY)
TOP-N QUERY
(ORDER BY)
http://www.popit.kr/druid-spark-performance
11
DRUID
 BENCHMARK
 Druid VS Apache SPARK
http://www.popit.kr/druid-spark-performance
12
DRUID
 BENCHMARK
 Druid VS Apache SPARK
http://www.popit.kr/druid-spark-performance
GROUP BY
timestamp襯 蠍一朱 rollup  轟 覓
=> 一危  , timestamp segment 蠍一朱 aggregation 螳れ 覩碁Μ 一
13
DRUID
curl -X POST '<queryable_host>:<port>/druid/v2/?pretty' -H 'Content-
Type:application/json' -d @<query_json_file>
 DRUID QUERY
 HTTP REST API
 JSON Format 朱 一危磯ゼ 譟壱.
 Aggregation Queries
(一危磯ゼ 轟 螳朱 磯 螻
螳覲襦 users蠏 讌螻 貎朱Μ 朱)
() Query襦 覃 伎
SELECT timestamp, AVG(users) FROM tbl
WHERE country='US' AND gender='M' GROUP BY hour
14
DRUID
"Druid is NOT time series DB"
Druid 覲 一危磯ゼ ロ讌 螻 蠍一ヾ 一危一 indexing 覲企 .
Table JOIN Syntax
15
DRUID
 蠍郁讌螳 螻手碓 DRUID 伎手鍵
Druid 螻手碓 る
Dataworks2017 覦
16
DRUID
 Druid 0.10.0  伎手鍵
Built-in SQL (Powered by Apache Calcite)
- REST API 訖襷  伎 JDBC Driver 螻.
- HIVE StorageHandler襯  Druid Input format 蟲
- Druid 蠍磯 Hive Table  襷  
Druid Input Format for Hive
17
DRUID
 Druid Query Recognition (Powered by Apache Calcite)
SELECT user, SUM(sales) AS s
FROM druid_table
WHERE month IN (12, 11, 10)
AND year=2017 AND state='CA'
GROUP BY user ORDER BY s DESC
LIMIT 10;
Apache Hive Query
18
DRUID
 Druid Query Recognition (Powered by Apache Calcite)
{ "queryType": "topN", "dataSource":
"sample_data", "dimension": "sample_dim",
"threshold": 5, "metric": "count",
"granularity": "all", "filter": { "type":
"and", "fields": [ { "type": "selector",
"dimension": "dim1", "value": "some_value" }, {
"type": "selector", "dimension": "dim2",
"value": "some_other_val" } ] },
"aggregations": [ { "type": "longSum", "name":
"count", "fieldName": "count" }, { "type":
"doubleSum", "name": "some_metric",
"fieldName": "some_metric" } ],
"postAggregations": [ { "type": "arithmetic",
"name": "sample_divide", "fn": "/", "fields": [
{ "type": "fieldAccess", "name": "some_metric",
"fieldName": "some_metric" }, { "type":
"fieldAccess", "name": "count", "fieldName":
"count" } ] } ], "intervals": [ "2013-08-
31T00:00:00.000/2013-09-03T00:00:00.000" ] }
Druid JSON Query
Table Scan
File Sink
Druid Input Format
19
DRUID
 Registering Druid Data Sources
 Point hive to the broker
- SET hive.druid.broker.address.default=druid.broker.hostname:8082;
 Create external table Statement
CREATE EXTERNAL TABLE druid_table
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource"='druid_source')
20
DRUID
 Push Data to Druid without Hive
 Push Data to Druid with Hive
CREATE TABLE druid_table
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource"="druid_source",
"druid.segment.granularity"="HOUR")
AS SELECT time, page, user, c_added FROM src;
21
DRUID
 Benefits both to Druid and Apache Hive
Druid
SQL Query襯  螳ロ
Hive襯 牛 Join螻 螳 覲旧″ 一一 螳ロ伎
Hive
れ螳 一危一  豌襴螳 螳ロ伎
22
DRUID
 蟆磯
 time dimension 一危一  焔レ .
 apache calcite & hive 磯朱 碁 伎 蟆 .
 讌襷
Order by 煙 れ願 top-n Query 
time dimension   貎朱Μ 一危一 伎 
 蟲 覈 蠍磯 , 蟯 讌螻 朱
讌蟾讌  覯 0.10.0 覯
TensorFlowOnSpark
- 2016 04 覿, 覲豌襴 蠍磯レ 豢螳.
- Hadoop Eco-System 蟆曙  煙 .
- Apache Spark襯 蠍磯朱 Tensorflow 襯    螳 螳覦.
(Yahoo  谿語)
23
TensorFlowOnSpark
譯殊 轟
 Pyspark 蠍磯
 Tensorboard 襯  螳.
 Python 2.7 - 3.x, Spark 1.6-2.x, TensorFlow 0.12-1.x, Hadoop 2.x 讌
24
25
APACHE KUDU
CRUD襯 讌 Column 蠍磯 ろ襴讌 讌. ( NoSQL DB)
Kudu?
覿 襴豺伎  譴  殊.
Apache Kudu
26
 企語 豢豌 :
 Traditional Hadoop Storage Leaves a Gap
Apache Kudu
27
 Hive HBase  譴願 レ 覈  蟾?
 RDB Table 一危 蟲譟
 High-latency
 CRUD 覿螳
 Low-latency Random Access
 CRUD 讌
 row-key り螳 企れ
(Monotonically Increasing keys 覓語)
Apache Kudu
 Cloudera 螳覦. 譯朱朱 谿語.
 Key-Value 蠍磯企, RDB Table  一危磯語 螳讌
(HBase Row-key り覓語襯 螻覩狩讌  螻, JSON朱 螳谿 一危磯ゼ れる慨讌  )
 API襯 牛 讌 蠏殊 螳ロ,
SQL 伎 RDB豌 郁鍵 Impala襯 轟伎 磯 蟆 朱. (Spark 讌)
れ襦 Impala  誤.
(企る慨 Impala襯 襾濠鍵  Kudu襯 襷蟇 蟾 螳..)
28
Kudu? Druid?
29
DB Engine
Not DB!!
HADOOP QUERY PERFORMANCE
SMACKDOWN
30
HADOOP QUERY PERFORMANCE SMACKDOWN
 讌
貉伎ろ碁 瑚 螳  貅企 觜螻 覦 企 覩瑚記 螳  誤磯 觜 螻旧豌
<る葦螻>
讌 覲 一危  豢 蠍一  狩襾殊るれ 觜蟲
31
MapReduce
LLAP
Presto
Tez
 ORC
 None
 Zlib
 Snappy
 Parquet
 None
 Snappy
 Gzip
 Sequence
 None
 Snappy
 Gzip
 Bzip
 Text
 None
 Snappy
 Gzip
 Bzip
HADOOP QUERY PERFORMANCE SMACKDOWN
ろ碁ゼ  QuerySet 66螳襯 譴觜
ろ 蟆郁骸, 螳 蟆曙 れ 貎朱Μれ ろ
- MapReduce : 8 Queries (80 Min)
- LLAP : 1 Query (10 Min)
- Spark : 2 Queries (20 Min)
- Presto : 5 Queries (50 Min)
- Tez : 2 Queries (20 Min)
32
HADOOP QUERY PERFORMANCE SMACKDOWN
33
MapReduce CUMULATIVE QUERY TIMES 1TB TPC_DS
(66螳 貎朱Μ襯  36螳企 伎 蠍)
HADOOP QUERY PERFORMANCE SMACKDOWN
34
蠏 豌襴  觜蟲 (豐) 螳 觜襯願  讌覲 貎朱Μ  觜蟲 (蟇)
LLAP螳 !
35
APACHE CALCITE
Query Planning Framework
Calcite?
覦伎. 覲糾鬼襯  蟯覓
APACHE CALCITE
- Cost-based Query Optimization
- 2013 豺 襦語 焔
- 2015 10 豺 覯 襦碁 濠鴬.
36
Julian Hyde螳 襦 譯朱
( Hortonworks ,
貉朱一 轟 覦襦 谿語)
APACHE CALCITE
Planning Queries
37
SELECT p.productName, COUNT(*) as cnt
FROM splunk.splunk AS s
JOIN mysql.products AS p
ON s.productID = p.productID
WHERE s.action = 'purchase'
GROUP BY p.productName
ORDER BY cnt DESC
SCAN SCAN
JOIN
FILTER
GROUP
BY
ORDER
BY
splunk mysql
KEY : productID
action="purchase"
APACHE CALCITE
Optimized Queries
38
SELECT p.productName, COUNT(*) as cnt
FROM splunk.splunk AS s
JOIN mysql.products AS p
ON s.productID = p.productID
WHERE s.action = 'purchase'
GROUP BY p.productName
ORDER BY cnt DESC
SCAN SCAN
JOIN
GROUP
BY
ORDER
BY
splunk mysql
KEY : productID
action="purchase"
FILTER
APACHE CALCITE
Using AdaptiveMonteCarlo Algorithm
39
 Harinarayan, Rajaraman, Ullman(1996), "Implementing data cubes efficiently"
 org.pentaho.aggdes.algorithm.impl.AdaptiveMonteCarloAlgorithm
40
THE FUTURE OF APACHE AMBARI
The Future Of Apache Ambari
蠍一ヾ AMBARI 螻
One-One Relationships
41
Ambari Ambari
Ambari
Cluster Cluster Cluster
Cluster
Ambari
Multi-Cluster 覿螳
HDP 2.6
Druid 0.92
Falcon 0.10.0
Spark2 2.1.1
.
HDP 2.6
Druid 0.92
Falcon 0.10.0
Spark2 2.1.1
.
Spark2 2.2.0
螳螳 Stack 覯る hdp 1:1 譬
 伎語 觜る  覦 伎 覈詩 覓語螳 .
(ex : Hive 觜るゼ 覦 伎 覈詩) 蠍一ヾ 蟇一 ろ Hive1.2
磯  2.0 螳 郁 苦..
The Future Of Apache Ambari
Ambari襯 Modular  
- Multiple Clusters
- Multiple Stack
- Multiple Service
- Multiple Service Versions
- Multiple Host Components
- Multiple Hosting Platforms
42
Packlets
Mpacks
螳覲願 襴曙 觜るれ れ
ex) HDFS-3.0.0-packlet, SPARK-2.0.0-packlet
Packlets襯 覈 覦壱 れ
ex) HDP-3.0.0-mpack, HDF-3.1.0-mpack
= Modular Upgrades 螳
INTERACTIVE PROCESSING & ANALYSIS
Visualization
企 Dataworks2017 谿語 蠍一れ
螳 蟲れ 襷 螻糾.
43
企至 一危磯ゼ  觜襯願 豌襴 蟆瑚
企至 一危磯ゼ 覲伎譯手 覿 蟆瑚
ろ 觜 覓瑚螳 一危 覿 る?
INTERACTIVE PROCESSING & ANALYSIS
44
Web-Based Application
蟇一 覈 蠍一れ 企 れ
Web-Application .
Interactive User-Interface
覿覿 襭れ  UI 蟆 磯狩
- Jupyter Notebook
- Apache NiFi
- Apache Zeppelin
- Tablue
INTERACTIVE PROCESSING & ANALYSIS
IBM Data Science Experience ( https://dataplatform.ibm.com/ )
Jupyter Notebook 蠍磯, Spark RStudio 襯 伎.
蠏碁 Machine Learning (scikit-learn, Spark MLlib) , Deep Learning(Tensorflow, Caffe..) 螳 り 覲.
Cloud, Local, Desktop 3螳讌   螻.
45
INTERACTIVE PROCESSING & ANALYSIS
Cask Data Application Platform (CDAP) http://cask.co/products/cdap/
OpenSource, Apache 2.0 License.
Apache Nifi   Nifi 一危 襴 豺譴る, CDAP 一危 豌襴 覿 譴
46
(59豐覿 )
INTERACTIVE PROCESSING & ANALYSIS
Infoworks Cloud Data Warehouse http://www.infoworks.io/cloud
れ 企殊磯 觜れ   一危磯れ 牛, 螳螻牛 譯朱 襭
47
INTERACTIVE PROCESSING & ANALYSIS
Pentaho http://www.pentaho.com/product
豺 蠏碁9 Pentaho  る/螳覦. Spark 誤 .
Nifi 誤壱伎 豌 ろ 覲 一危 豌襴 螻殊 煙 所 螻,
Tableu 螳 誤壱伎る  ろ覲 豢ル 蟆郁骸 一危磯れ 螳襯  .
48
INTERACTIVE PROCESSING & ANALYSIS
TERADATA ASTER http://www.teradata.com/products-and-services/analytics-from-aster-overview
TERADATA 一危 蟯襴 覿 蟲.
ろ 豢 螳覦 3覈 襷 aster data 螳 . Teradata螳 2011 語.
49
INTERACTIVE PROCESSING & ANALYSIS
IMPETUS STREAM ANALYTIX https://streamanalytix.com
れ螳 豌襴 覦 覿 螳ロ 
50

More Related Content

DataWorks Summit 2017

  • 1. DataWorks Summit 2017 SAN JOSE, USA JUNE 13-15 蠍一螳覦 覦
  • 2. KEYPOINT 2 譯殊 Streaming Processing Machine Learning Apache NiFi & MiNiFi Tensorflow On Spark Caf辿 On Spark Anomaly Detection Fraud Prevention Financial Crime Detection Traffic Prediction Apache SPARK Deep Learning Apache Kafka Interactive Processing & Analysis Apache Zeppelin IBM (Data Science Expericence) CASK Infoworks
  • 3. CONTENTS TECHNICAL SESSION Druid TensorflowOnSpark Apache Kudu Hadoop Query Performance Smackdown Apache Calcite Apache Beam The Future Of Apache Ambari INTERACTIVE PROCESSING & ANALYSIS 3
  • 4. 4 DRUID 螻(Time Series) 一危 豌襴 麹 Data Store Engine. Apache Open Source License伎, 讌 Apache 襦碁 . 殊 讌 れ螳 螻 一危磯ゼ 企至 れ螳朱 觜襯 蟆 譟壱 蟾 朱 螻覩殊朱 .
  • 5. 5 DRUID 螻 一危磯れ 譯 Queryれ 危エ覲企 れ螻 螳 襦螳 襷 覦. 蠏碁 螳 蠍一朱 一 . SELECT year, month, city, SUM(sales) FROM table WHERE month IN (12, 11, 10) AND year=2017 AND state='CA' timestamp publisher advertiser click price 2017-01-01T00:00:35Z thinkware.com google.com 0 0.65 2017-01-01T00:07:35Z thinkware.com google.com 1 0.45 2017-01-01T01:02:00Z facebook.com google.com 0 0.82 2017-01-01T02:36:00Z facebook.com google.com 1 1.53 朱朱 瑚係襾狩 5覦焔 朱 襭企 螳 瑚係襾狩碁 貉 蠍磯朱
  • 6. 6 DRUID 貉殊 一危一 id襯 覿 一危磯ゼ 豢 (String Type襷 讌) thinkware.com -> 0, facebook.com -> 1 google.com -> 0 timestamp publisher advertiser click price 2017-01-01T00:00:35Z thinkware.com google.com 0 0.65 2017-01-01T00:03:11Z thinkware.com google.com 0 0.62 2017-01-01T00:07:35Z thinkware.com google.com 1 0.45 2017-01-01T01:02:00Z facebook.com google.com 0 0.82 2017-01-01T02:01:00Z facebook.com google.com 0 0.91 2017-01-01T02:36:00Z facebook.com google.com 1 1.53 ル {publisher} -> [0, 0, 0, 1, 1, 1] {advertiser} -> [0, 0, 0, 0, 0, 0]
  • 7. 7 DRUID 一危磯ゼ 譟壱蠍一 INDEX 螻襴讀 BITMAP INDEX襯 煙ロ 豺襯 蠍磯 timestamp publisher advertiser click price 2017-01-01T00:00:35Z thinkware.com google.com 0 0.65 2017-01-01T00:03:11Z thinkware.com google.com 0 0.62 2017-01-01T00:07:35Z thinkware.com google.com 1 0.45 2017-01-01T01:02:00Z facebook.com google.com 0 0.82 2017-01-01T02:01:00Z facebook.com google.com 0 0.91 2017-01-01T02:36:00Z facebook.com google.com 1 1.53 thinkware.com -> [111000] facebook.com -> [000111]
  • 8. 8 DRUID BITMAP INDEX OR 一 thinkware.com -> [111000] facebook.com -> [000111] SELECT * FROM table WHERE publisher= 'thinkware.com' OR publisher='facebook.com' thinkware.com OR facebook.com [111000] OR [000111] [111111] "1~6螻 襷れ広"
  • 9. 9 DRUID "Druid is NOT time series DB" Druid 覲 一危磯ゼ ロ讌 螻 蠍一ヾ 一危一 indexing 覲企 . BROKER REALTIME HISTORICAL HDFS CLIENT DATA STREAM HAND OFF INDEXING INDEXING
  • 10. 10 DRUID BENCHMARK Dataset : TPC-H 100G lineitem (600M Rows) AWS EMR 5.0.0, r3.4xlarge (2.5GHz * 16, 122G, 320G SSD) * 6 workers AGGREGATION QUERY (SUM, COUNT, LIKE, GROUP BY) TOP-N QUERY (ORDER BY) http://www.popit.kr/druid-spark-performance
  • 11. 11 DRUID BENCHMARK Druid VS Apache SPARK http://www.popit.kr/druid-spark-performance
  • 12. 12 DRUID BENCHMARK Druid VS Apache SPARK http://www.popit.kr/druid-spark-performance GROUP BY timestamp襯 蠍一朱 rollup 轟 覓 => 一危 , timestamp segment 蠍一朱 aggregation 螳れ 覩碁Μ 一
  • 13. 13 DRUID curl -X POST '<queryable_host>:<port>/druid/v2/?pretty' -H 'Content- Type:application/json' -d @<query_json_file> DRUID QUERY HTTP REST API JSON Format 朱 一危磯ゼ 譟壱. Aggregation Queries (一危磯ゼ 轟 螳朱 磯 螻 螳覲襦 users蠏 讌螻 貎朱Μ 朱) () Query襦 覃 伎 SELECT timestamp, AVG(users) FROM tbl WHERE country='US' AND gender='M' GROUP BY hour
  • 14. 14 DRUID "Druid is NOT time series DB" Druid 覲 一危磯ゼ ロ讌 螻 蠍一ヾ 一危一 indexing 覲企 . Table JOIN Syntax
  • 15. 15 DRUID 蠍郁讌螳 螻手碓 DRUID 伎手鍵 Druid 螻手碓 る Dataworks2017 覦
  • 16. 16 DRUID Druid 0.10.0 伎手鍵 Built-in SQL (Powered by Apache Calcite) - REST API 訖襷 伎 JDBC Driver 螻. - HIVE StorageHandler襯 Druid Input format 蟲 - Druid 蠍磯 Hive Table 襷 Druid Input Format for Hive
  • 17. 17 DRUID Druid Query Recognition (Powered by Apache Calcite) SELECT user, SUM(sales) AS s FROM druid_table WHERE month IN (12, 11, 10) AND year=2017 AND state='CA' GROUP BY user ORDER BY s DESC LIMIT 10; Apache Hive Query
  • 18. 18 DRUID Druid Query Recognition (Powered by Apache Calcite) { "queryType": "topN", "dataSource": "sample_data", "dimension": "sample_dim", "threshold": 5, "metric": "count", "granularity": "all", "filter": { "type": "and", "fields": [ { "type": "selector", "dimension": "dim1", "value": "some_value" }, { "type": "selector", "dimension": "dim2", "value": "some_other_val" } ] }, "aggregations": [ { "type": "longSum", "name": "count", "fieldName": "count" }, { "type": "doubleSum", "name": "some_metric", "fieldName": "some_metric" } ], "postAggregations": [ { "type": "arithmetic", "name": "sample_divide", "fn": "/", "fields": [ { "type": "fieldAccess", "name": "some_metric", "fieldName": "some_metric" }, { "type": "fieldAccess", "name": "count", "fieldName": "count" } ] } ], "intervals": [ "2013-08- 31T00:00:00.000/2013-09-03T00:00:00.000" ] } Druid JSON Query Table Scan File Sink Druid Input Format
  • 19. 19 DRUID Registering Druid Data Sources Point hive to the broker - SET hive.druid.broker.address.default=druid.broker.hostname:8082; Create external table Statement CREATE EXTERNAL TABLE druid_table STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource"='druid_source')
  • 20. 20 DRUID Push Data to Druid without Hive Push Data to Druid with Hive CREATE TABLE druid_table STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource"="druid_source", "druid.segment.granularity"="HOUR") AS SELECT time, page, user, c_added FROM src;
  • 21. 21 DRUID Benefits both to Druid and Apache Hive Druid SQL Query襯 螳ロ Hive襯 牛 Join螻 螳 覲旧″ 一一 螳ロ伎 Hive れ螳 一危一 豌襴螳 螳ロ伎
  • 22. 22 DRUID 蟆磯 time dimension 一危一 焔レ . apache calcite & hive 磯朱 碁 伎 蟆 . 讌襷 Order by 煙 れ願 top-n Query time dimension 貎朱Μ 一危一 伎 蟲 覈 蠍磯 , 蟯 讌螻 朱 讌蟾讌 覯 0.10.0 覯
  • 23. TensorFlowOnSpark - 2016 04 覿, 覲豌襴 蠍磯レ 豢螳. - Hadoop Eco-System 蟆曙 煙 . - Apache Spark襯 蠍磯朱 Tensorflow 襯 螳 螳覦. (Yahoo 谿語) 23
  • 24. TensorFlowOnSpark 譯殊 轟 Pyspark 蠍磯 Tensorboard 襯 螳. Python 2.7 - 3.x, Spark 1.6-2.x, TensorFlow 0.12-1.x, Hadoop 2.x 讌 24
  • 25. 25 APACHE KUDU CRUD襯 讌 Column 蠍磯 ろ襴讌 讌. ( NoSQL DB) Kudu? 覿 襴豺伎 譴 殊.
  • 26. Apache Kudu 26 企語 豢豌 : Traditional Hadoop Storage Leaves a Gap
  • 27. Apache Kudu 27 Hive HBase 譴願 レ 覈 蟾? RDB Table 一危 蟲譟 High-latency CRUD 覿螳 Low-latency Random Access CRUD 讌 row-key り螳 企れ (Monotonically Increasing keys 覓語)
  • 28. Apache Kudu Cloudera 螳覦. 譯朱朱 谿語. Key-Value 蠍磯企, RDB Table 一危磯語 螳讌 (HBase Row-key り覓語襯 螻覩狩讌 螻, JSON朱 螳谿 一危磯ゼ れる慨讌 ) API襯 牛 讌 蠏殊 螳ロ, SQL 伎 RDB豌 郁鍵 Impala襯 轟伎 磯 蟆 朱. (Spark 讌) れ襦 Impala 誤. (企る慨 Impala襯 襾濠鍵 Kudu襯 襷蟇 蟾 螳..) 28
  • 31. HADOOP QUERY PERFORMANCE SMACKDOWN 讌 貉伎ろ碁 瑚 螳 貅企 觜螻 覦 企 覩瑚記 螳 誤磯 觜 螻旧豌 <る葦螻> 讌 覲 一危 豢 蠍一 狩襾殊るれ 觜蟲 31 MapReduce LLAP Presto Tez ORC None Zlib Snappy Parquet None Snappy Gzip Sequence None Snappy Gzip Bzip Text None Snappy Gzip Bzip
  • 32. HADOOP QUERY PERFORMANCE SMACKDOWN ろ碁ゼ QuerySet 66螳襯 譴觜 ろ 蟆郁骸, 螳 蟆曙 れ 貎朱Μれ ろ - MapReduce : 8 Queries (80 Min) - LLAP : 1 Query (10 Min) - Spark : 2 Queries (20 Min) - Presto : 5 Queries (50 Min) - Tez : 2 Queries (20 Min) 32
  • 33. HADOOP QUERY PERFORMANCE SMACKDOWN 33 MapReduce CUMULATIVE QUERY TIMES 1TB TPC_DS (66螳 貎朱Μ襯 36螳企 伎 蠍)
  • 34. HADOOP QUERY PERFORMANCE SMACKDOWN 34 蠏 豌襴 觜蟲 (豐) 螳 觜襯願 讌覲 貎朱Μ 觜蟲 (蟇) LLAP螳 !
  • 35. 35 APACHE CALCITE Query Planning Framework Calcite? 覦伎. 覲糾鬼襯 蟯覓
  • 36. APACHE CALCITE - Cost-based Query Optimization - 2013 豺 襦語 焔 - 2015 10 豺 覯 襦碁 濠鴬. 36 Julian Hyde螳 襦 譯朱 ( Hortonworks , 貉朱一 轟 覦襦 谿語)
  • 37. APACHE CALCITE Planning Queries 37 SELECT p.productName, COUNT(*) as cnt FROM splunk.splunk AS s JOIN mysql.products AS p ON s.productID = p.productID WHERE s.action = 'purchase' GROUP BY p.productName ORDER BY cnt DESC SCAN SCAN JOIN FILTER GROUP BY ORDER BY splunk mysql KEY : productID action="purchase"
  • 38. APACHE CALCITE Optimized Queries 38 SELECT p.productName, COUNT(*) as cnt FROM splunk.splunk AS s JOIN mysql.products AS p ON s.productID = p.productID WHERE s.action = 'purchase' GROUP BY p.productName ORDER BY cnt DESC SCAN SCAN JOIN GROUP BY ORDER BY splunk mysql KEY : productID action="purchase" FILTER
  • 39. APACHE CALCITE Using AdaptiveMonteCarlo Algorithm 39 Harinarayan, Rajaraman, Ullman(1996), "Implementing data cubes efficiently" org.pentaho.aggdes.algorithm.impl.AdaptiveMonteCarloAlgorithm
  • 40. 40 THE FUTURE OF APACHE AMBARI
  • 41. The Future Of Apache Ambari 蠍一ヾ AMBARI 螻 One-One Relationships 41 Ambari Ambari Ambari Cluster Cluster Cluster Cluster Ambari Multi-Cluster 覿螳 HDP 2.6 Druid 0.92 Falcon 0.10.0 Spark2 2.1.1 . HDP 2.6 Druid 0.92 Falcon 0.10.0 Spark2 2.1.1 . Spark2 2.2.0 螳螳 Stack 覯る hdp 1:1 譬 伎語 觜る 覦 伎 覈詩 覓語螳 . (ex : Hive 觜るゼ 覦 伎 覈詩) 蠍一ヾ 蟇一 ろ Hive1.2 磯 2.0 螳 郁 苦..
  • 42. The Future Of Apache Ambari Ambari襯 Modular - Multiple Clusters - Multiple Stack - Multiple Service - Multiple Service Versions - Multiple Host Components - Multiple Hosting Platforms 42 Packlets Mpacks 螳覲願 襴曙 觜るれ れ ex) HDFS-3.0.0-packlet, SPARK-2.0.0-packlet Packlets襯 覈 覦壱 れ ex) HDP-3.0.0-mpack, HDF-3.1.0-mpack = Modular Upgrades 螳
  • 43. INTERACTIVE PROCESSING & ANALYSIS Visualization 企 Dataworks2017 谿語 蠍一れ 螳 蟲れ 襷 螻糾. 43 企至 一危磯ゼ 觜襯願 豌襴 蟆瑚 企至 一危磯ゼ 覲伎譯手 覿 蟆瑚 ろ 觜 覓瑚螳 一危 覿 る?
  • 44. INTERACTIVE PROCESSING & ANALYSIS 44 Web-Based Application 蟇一 覈 蠍一れ 企 れ Web-Application . Interactive User-Interface 覿覿 襭れ UI 蟆 磯狩 - Jupyter Notebook - Apache NiFi - Apache Zeppelin - Tablue
  • 45. INTERACTIVE PROCESSING & ANALYSIS IBM Data Science Experience ( https://dataplatform.ibm.com/ ) Jupyter Notebook 蠍磯, Spark RStudio 襯 伎. 蠏碁 Machine Learning (scikit-learn, Spark MLlib) , Deep Learning(Tensorflow, Caffe..) 螳 り 覲. Cloud, Local, Desktop 3螳讌 螻. 45
  • 46. INTERACTIVE PROCESSING & ANALYSIS Cask Data Application Platform (CDAP) http://cask.co/products/cdap/ OpenSource, Apache 2.0 License. Apache Nifi Nifi 一危 襴 豺譴る, CDAP 一危 豌襴 覿 譴 46 (59豐覿 )
  • 47. INTERACTIVE PROCESSING & ANALYSIS Infoworks Cloud Data Warehouse http://www.infoworks.io/cloud れ 企殊磯 觜れ 一危磯れ 牛, 螳螻牛 譯朱 襭 47
  • 48. INTERACTIVE PROCESSING & ANALYSIS Pentaho http://www.pentaho.com/product 豺 蠏碁9 Pentaho る/螳覦. Spark 誤 . Nifi 誤壱伎 豌 ろ 覲 一危 豌襴 螻殊 煙 所 螻, Tableu 螳 誤壱伎る ろ覲 豢ル 蟆郁骸 一危磯れ 螳襯 . 48
  • 49. INTERACTIVE PROCESSING & ANALYSIS TERADATA ASTER http://www.teradata.com/products-and-services/analytics-from-aster-overview TERADATA 一危 蟯襴 覿 蟲. ろ 豢 螳覦 3覈 襷 aster data 螳 . Teradata螳 2011 語. 49
  • 50. INTERACTIVE PROCESSING & ANALYSIS IMPETUS STREAM ANALYTIX https://streamanalytix.com れ螳 豌襴 覦 覿 螳ロ 50