ݺߣ

ݺߣShare a Scribd company logo
The BI for Hadoop Benchmark
Q1 2016
atscale.com/benchmark
2? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY
Hadoop Use Cases have evolved
74%
62%
65%
ETL Data Science Business
Intelligence
51% 56%
69%
ETL Data Science Business
Intelligence
Yesterday Today
atscale.com/survey
3? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY
Self-Service leads to Business Value
atscale.com/survey
41%
61%
59%
39%
No Access Self Service
Companies that
provide self-service
accessto business units
are 50% more likely
to gain value out of Hadoop
4? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY
Most Dont Have Self-Service on Hadoop
atscale.com/survey
Close to 60% have not
provided self-service
accessto Hadoop yet
41%
59%
Yes
No
5? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY
Why Self-Service is so Hard
1. Current BI Tools are limited
2. Hadoop is not optimizedfor performance
3. Governance and security are an issue
4. Current approaches are unnatural
atscale.com/benchmark
The BI for Hadoop Benchmark
Q1 2016
atscale.com/benchmark
7? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY
Benchmark Framework
Three key conceptsneed to be inspected when evaluatingSQL-on-Hadoop enginesand their fitness to
satisfy Business Intelligenceworkloads:
q? Performson Big Data: the SQL-on-Hadoop enginemust be able to consistentlyanalyze billionsor
trillionsof rowsof datawithoutgenerating errorsand with response times on the order of 10s or
100s of seconds.
q? Fast onSmall Data: the engine needs to deliver interactiveperformanceon known querypatterns
and as such itis importantthat the SQL-on-Hadoop enginereturn results in no greater than a few
secondson small data sets (on the order of thousandsor millionsof rows).
q? Stable for Many Users: Enterprise BI user bases consistof hundredsor thousandsof data works,
and as aresult the underlyingSQL-on-Hadoop enginemust performreliablyunder highly
concurrentanalysisworkloads.
atscale.com/benchmark
8? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY
Benchmark Queries
Data Set:Star Schema Benchmark (SSB)data set
6B rows, 13 queries, 3 patterns
1. Quick Metricqueries: Compute a particular metric value for a period of time. These
queries have a small number of joins and minimal or no group-bys (Q1.1 - Q1.3)
2. Product Insight queries:Compute a metric (or several metrics) aggregated against a
set of product and date based dimensions. These queries include medium sized joins
and a small number of group-bys (Q2.1 - Q2.3)
3. Customer Insight: Compute a metric (or several metrics) aggregated against a set of
product, customer, and date-based dimensions. These queries include both medium
and very large sized joins as well as a number of group-bys (Q3.1 - Q4.3)
atscale.com/benchmark
9? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY
Un-Aggregated Results
atscale.com/benchmark
10? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY
Benchmark Key Findings
q? One engine does not fit all: Dependingon raw data size, query complexity,and the target number of
end-users enterpriseswill find that oneengine cant accomplish itall. Each enginehas its own
sweet spotand enterprises may find that a blended usage SQL-on-Hadoop enginesmight fit their
companysgoals better.
q? Small vs. Big Data: While all queryengines successfullycompleted the Large Data query tests,
Spark SQL and Impala performed better on smaller data sets - tables with thousandsor several
million rowsof data.
q? Few vs. Many Users: Impala has shown the best concurrencytestresults, over Hiveand Spark-SQL.
Companiesthat anticipateconnectinglargenumbersof business users to Hadoop may want to
consider Impala.
q? Constant Innovation: Open sourcecontribution,asseen by Spark SQL improvements, provides
constantinnovation. Weexpect the industryto continueinnovatinghere: for example,Cloudera
donated the Impala projectto the ApacheSoftware Foundation thispastNovember. There isno
doubtmore innovation will comeoutfromthis new development.
atscale.com/benchmark
Environment Details
atscale.com/benchmark
12? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY
Benchmarks: Environment
RAM pernode 128G
CPU specs for data (worker) nodes 32 CPU cores
Storage specs for data (worker) nodes 2x 512mb SSD
For our test environment weused an 12 node cluster with:
? 1 master node
? 1 gateway node
? 10 data nodes
13? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY
Benchmarks: Data Set
Table Name
Number of
Rows
CUSTOMER_SMALL 30M
CUSTOMER 1B
LINEORDER 6B
SUPPLIER 2M
PART 2M
DATE 16K
14? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY
Benchmarks: Queries
Query ID Number of Joins Largest Join Table Number of Group Bys Number of Filters Comments
Q1.1 1 16,799 0 3 1 range condition, 1 comparative filter condition directly on LINEORDERtable
Q1.2 1 16,799 0 3 2 range filter conditions directly on LINEORDERtable
Q1.3 1 16,799 0 4
2 range filter conditions directly on LINEORDERtable, 2 conditions on joined
table
Q2.1 3 2,000,000 2 2 filter on p_category (less selective)
Q2.2 3 2,000,000 2 2 filter on p_brand, 2 values (more selective)
Q2.3 3 2,000,000 2 2 filter on p_brand, 1 value (most selective)
Q3.1 3 1,050,000,000 3 3 filter on region (less selective)
Q3.2 3 1,050,000,000 3 3 filter on nation (more selective)
Q3.3 3 1,050,000,000 3 3 filter on city (most selective)
Q3.4 3 1,050,000,000 3 3 filter on city (most selective) and month (vs. year)
Q4.1 4 1,050,000,000 2 2
Q4.2 4 1,050,000,000 3 3 includes filter on year (more selective)
Q4.3 4 1,050,000,000 3 3 includes filter on year and nation (most selective)
About AtScale
atscale.com/benchmark
16? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY
AtScale Intelligence Platform
I.T. needs
Control & Consistency
The Business needs
Freedom & Self-Service
The Business Interface
for Hadoop
17? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY
Superior Architecture
q? Any BI tool
q? Industry standards
q? Schema on demand
q? Write once

More Related Content

The Business Intelligence for Hadoop Benchmark - Q1 2016

  • 1. The BI for Hadoop Benchmark Q1 2016 atscale.com/benchmark
  • 2. 2? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY Hadoop Use Cases have evolved 74% 62% 65% ETL Data Science Business Intelligence 51% 56% 69% ETL Data Science Business Intelligence Yesterday Today atscale.com/survey
  • 3. 3? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY Self-Service leads to Business Value atscale.com/survey 41% 61% 59% 39% No Access Self Service Companies that provide self-service accessto business units are 50% more likely to gain value out of Hadoop
  • 4. 4? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY Most Dont Have Self-Service on Hadoop atscale.com/survey Close to 60% have not provided self-service accessto Hadoop yet 41% 59% Yes No
  • 5. 5? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY Why Self-Service is so Hard 1. Current BI Tools are limited 2. Hadoop is not optimizedfor performance 3. Governance and security are an issue 4. Current approaches are unnatural atscale.com/benchmark
  • 6. The BI for Hadoop Benchmark Q1 2016 atscale.com/benchmark
  • 7. 7? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY Benchmark Framework Three key conceptsneed to be inspected when evaluatingSQL-on-Hadoop enginesand their fitness to satisfy Business Intelligenceworkloads: q? Performson Big Data: the SQL-on-Hadoop enginemust be able to consistentlyanalyze billionsor trillionsof rowsof datawithoutgenerating errorsand with response times on the order of 10s or 100s of seconds. q? Fast onSmall Data: the engine needs to deliver interactiveperformanceon known querypatterns and as such itis importantthat the SQL-on-Hadoop enginereturn results in no greater than a few secondson small data sets (on the order of thousandsor millionsof rows). q? Stable for Many Users: Enterprise BI user bases consistof hundredsor thousandsof data works, and as aresult the underlyingSQL-on-Hadoop enginemust performreliablyunder highly concurrentanalysisworkloads. atscale.com/benchmark
  • 8. 8? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY Benchmark Queries Data Set:Star Schema Benchmark (SSB)data set 6B rows, 13 queries, 3 patterns 1. Quick Metricqueries: Compute a particular metric value for a period of time. These queries have a small number of joins and minimal or no group-bys (Q1.1 - Q1.3) 2. Product Insight queries:Compute a metric (or several metrics) aggregated against a set of product and date based dimensions. These queries include medium sized joins and a small number of group-bys (Q2.1 - Q2.3) 3. Customer Insight: Compute a metric (or several metrics) aggregated against a set of product, customer, and date-based dimensions. These queries include both medium and very large sized joins as well as a number of group-bys (Q3.1 - Q4.3) atscale.com/benchmark
  • 9. 9? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY Un-Aggregated Results atscale.com/benchmark
  • 10. 10? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY Benchmark Key Findings q? One engine does not fit all: Dependingon raw data size, query complexity,and the target number of end-users enterpriseswill find that oneengine cant accomplish itall. Each enginehas its own sweet spotand enterprises may find that a blended usage SQL-on-Hadoop enginesmight fit their companysgoals better. q? Small vs. Big Data: While all queryengines successfullycompleted the Large Data query tests, Spark SQL and Impala performed better on smaller data sets - tables with thousandsor several million rowsof data. q? Few vs. Many Users: Impala has shown the best concurrencytestresults, over Hiveand Spark-SQL. Companiesthat anticipateconnectinglargenumbersof business users to Hadoop may want to consider Impala. q? Constant Innovation: Open sourcecontribution,asseen by Spark SQL improvements, provides constantinnovation. Weexpect the industryto continueinnovatinghere: for example,Cloudera donated the Impala projectto the ApacheSoftware Foundation thispastNovember. There isno doubtmore innovation will comeoutfromthis new development. atscale.com/benchmark
  • 12. 12? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY Benchmarks: Environment RAM pernode 128G CPU specs for data (worker) nodes 32 CPU cores Storage specs for data (worker) nodes 2x 512mb SSD For our test environment weused an 12 node cluster with: ? 1 master node ? 1 gateway node ? 10 data nodes
  • 13. 13? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY Benchmarks: Data Set Table Name Number of Rows CUSTOMER_SMALL 30M CUSTOMER 1B LINEORDER 6B SUPPLIER 2M PART 2M DATE 16K
  • 14. 14? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY Benchmarks: Queries Query ID Number of Joins Largest Join Table Number of Group Bys Number of Filters Comments Q1.1 1 16,799 0 3 1 range condition, 1 comparative filter condition directly on LINEORDERtable Q1.2 1 16,799 0 3 2 range filter conditions directly on LINEORDERtable Q1.3 1 16,799 0 4 2 range filter conditions directly on LINEORDERtable, 2 conditions on joined table Q2.1 3 2,000,000 2 2 filter on p_category (less selective) Q2.2 3 2,000,000 2 2 filter on p_brand, 2 values (more selective) Q2.3 3 2,000,000 2 2 filter on p_brand, 1 value (most selective) Q3.1 3 1,050,000,000 3 3 filter on region (less selective) Q3.2 3 1,050,000,000 3 3 filter on nation (more selective) Q3.3 3 1,050,000,000 3 3 filter on city (most selective) Q3.4 3 1,050,000,000 3 3 filter on city (most selective) and month (vs. year) Q4.1 4 1,050,000,000 2 2 Q4.2 4 1,050,000,000 3 3 includes filter on year (more selective) Q4.3 4 1,050,000,000 3 3 includes filter on year and nation (most selective)
  • 16. 16? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY AtScale Intelligence Platform I.T. needs Control & Consistency The Business needs Freedom & Self-Service The Business Interface for Hadoop
  • 17. 17? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY Superior Architecture q? Any BI tool q? Industry standards q? Schema on demand q? Write once