6. Hadoop-like parallel data processing
systems from 1,000,000 feet
? Hadoop等の並列データ処理系
– Hadoop, Hive, Impala, Presto, Spark, Tez
? 玉石混淆の処理系の本質を紹介
– Impala, Prestoって何で速いの?
– Spark, Tezって何?
7. MapReduce: A major step backwards
? By D. DeWitt and M. Stonebraker
? MapReduce is not novel
http://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html
The MapReduce community seems to feel that they have discovered an entirely new paradigm for
processing large data sets. In actuality, the techniques employed by MapReduce are more than 20
years old. The idea of partitioning a large data set into smaller partitions was first proposed in
"Application of Hash to Data Base Machine and Its Architecture" [11] as the basis for a new type of join
algorithm. In "Multiprocessor Hash-Based Join Algorithms," [7], Gerber demonstrated how
Kitsuregawa's techniques could be extended to execute joins in parallel on a shared-nothing [8]
cluster using a combination of partitioned tables, partitioned execution, and hash based splitting.
DeWitt [2] showed how these techniques could be adopted to execute aggregates with and without
group by clauses in parallel. DeWitt and Gray [6] described parallel database systems and how they
process queries. Shatdal and Naughton [9] explored alternative strategies for executing aggregates in
parallel.
Teradata has been selling a commercial DBMS utilizing all of these techniques for more than 20
years; exactly the techniques that the MapReduce crowd claims to have invented.
12. [補足] Hash Join w/ Left/Right/Bushy tree
R1
R2
R4
R3
R1
R2
R4
R3
R1 R2 R4R3
J1
J2
J3
J1
J2
J3
J1 J2
J3
J1a
a b c
a b c
a b c
B P
P
P
B
B
B P
B
B
P
P
B P PB
B P
Sequential
Processing
Pipelined
Processing
J1b J1c
J1a J1b J1c
J2a J2b J2c
J3a J3b J3c
J1a J1b J1c
J2a J2b J2c
LD
RD
BS
13. [補足] Hash Join w/ other trees
Segmented RD Zig Zag
Figures by courtesy of Dr. Nakano.
? 80年代後半から90年代前半に盛んに研究
? (おそらく)商用DBへは普及せず
15. Spark SQL, Hive on Tez
? SQL on top of DAG
– DAGの最適化は探索空間の爆発により困難
– Treeにおいても木の形を限定して探索
? 結局Treeを用いた最適化
– 下がDAGである意味がなくなっている
– Left-deep, Bushy in Hive on Tez (H2 2014)
– 論理的にはImpala, Prestoと同じ