�ݺ�ߣ

The BI for Hadoop Benchmark
Q1 2016
atscale.com/benchmark

2? 2015 ATSCALE, INC. ALLRIGHTSRESERVED. CONFIDENTIAL & PROPRIETARY
Hadoop Use Cases have evolved
74%
62%
65%
ETL Data Science Business
Intelligence
51% 56%
69%
ETL Data Science Business
Intelligence
Yesterday Today
atscale.com/survey

Self-Service leads to Business Value
atscale.com/survey
41%
61%
59%
39%
No Access Self Service
Companies that
provide self-service
accessto business units
are 50% more likely
to gain value out of Hadoop

Most Don��t Have Self-Service on Hadoop
atscale.com/survey
Close to 60% have not
provided self-service
accessto Hadoop yet
41%
59%
Yes
No

Why Self-Service is so Hard
1. Current BI Tools are limited
2. Hadoop is not optimizedfor performance
3. Governance and security are an issue
4. Current approaches are unnatural

Benchmark Framework
Three key conceptsneed to be inspected when evaluatingSQL-on-Hadoop enginesand their fitness to
satisfy Business Intelligenceworkloads:
q? Performson Big Data: the SQL-on-Hadoop enginemust be able to consistentlyanalyze billionsor
trillionsof rowsof datawithoutgenerating errorsand with response times on the order of 10s or
100s of seconds.
q? Fast onSmall Data: the engine needs to deliver interactiveperformanceon known querypatterns
and as such itis importantthat the SQL-on-Hadoop enginereturn results in no greater than a few
secondson small data sets (on the order of thousandsor millionsof rows).
q? Stable for Many Users: Enterprise BI user bases consistof hundredsor thousandsof data works,
and as aresult the underlyingSQL-on-Hadoop enginemust performreliablyunder highly
concurrentanalysisworkloads.

Benchmark Queries
Data Set:Star Schema Benchmark (SSB)data set
6B rows, 13 queries, 3 patterns
1. ��Quick Metric��queries: Compute a particular metric value for a period of time. These
queries have a small number of joins and minimal or no group-bys (Q1.1 - Q1.3)
2. ��Product Insight�� queries:Compute a metric (or several metrics) aggregated against a
set of product and date based dimensions. These queries include ��medium�� sized joins
and a small number of group-bys (Q2.1 - Q2.3)
3. ��Customer Insight��: Compute a metric (or several metrics) aggregated against a set of
product, customer, and date-based dimensions. These queries include both ��medium��
and ��very large�� sized joins as well as a number of group-bys (Q3.1 - Q4.3)

Un-Aggregated Results

Benchmark Key Findings
q? One engine does not fit all: Dependingon raw data size, query complexity,and the target number of
end-users enterpriseswill find that oneengine can��t accomplish itall. Each enginehas its own
��sweet spot��and enterprises may find that a blended usage SQL-on-Hadoop enginesmight fit their
company��sgoals better.
q? Small vs. Big Data: While all queryengines successfullycompleted the ��Large Data�� query tests,
Spark SQL and Impala performed better on smaller data sets - tables with thousandsor several
million rowsof data.
q? Few vs. Many Users: Impala has shown the best concurrencytestresults, over Hiveand Spark-SQL.
Companiesthat anticipateconnectinglargenumbersof business users to Hadoop may want to
consider Impala.
q? Constant Innovation: Open sourcecontribution,asseen by Spark SQL improvements, provides
constantinnovation. Weexpect the industryto continueinnovatinghere: for example,Cloudera
donated the Impala projectto the ApacheSoftware Foundation thispastNovember. There isno
doubtmore innovation will comeoutfromthis new development.

Environment Details

Benchmarks: Environment
RAM pernode 128G
CPU specs for data (worker) nodes 32 CPU cores
Storage specs for data (worker) nodes 2x 512mb SSD
For our test environment weused an 12 node cluster with:
? 1 master node
? 1 gateway node
? 10 data nodes

Benchmarks: Data Set
Table Name
Number of
Rows
CUSTOMER_SMALL 30M
CUSTOMER 1B
LINEORDER 6B
SUPPLIER 2M
PART 2M
DATE 16K

Benchmarks: Queries
Query ID Number of Joins Largest Join Table Number of Group Bys Number of Filters Comments
Q1.1 1 16,799 0 3 1 range condition, 1 comparative filter condition directly on LINEORDERtable
Q1.2 1 16,799 0 3 2 range filter conditions directly on LINEORDERtable
Q1.3 1 16,799 0 4
2 range filter conditions directly on LINEORDERtable, 2 conditions on joined
table
Q2.1 3 2,000,000 2 2 filter on p_category (less selective)
Q2.2 3 2,000,000 2 2 filter on p_brand, 2 values (more selective)
Q2.3 3 2,000,000 2 2 filter on p_brand, 1 value (most selective)
Q3.1 3 1,050,000,000 3 3 filter on region (less selective)
Q3.2 3 1,050,000,000 3 3 filter on nation (more selective)
Q3.3 3 1,050,000,000 3 3 filter on city (most selective)
Q3.4 3 1,050,000,000 3 3 filter on city (most selective) and month (vs. year)
Q4.1 4 1,050,000,000 2 2
Q4.2 4 1,050,000,000 3 3 includes filter on year (more selective)
Q4.3 4 1,050,000,000 3 3 includes filter on year and nation (most selective)

About AtScale

AtScale Intelligence Platform
I.T. needs
Control & Consistency
The Business needs
Freedom & Self-Service
The Business Interface
for Hadoop

Superior Architecture
q? Any BI tool
q? Industry standards
q? Schema on demand
q? Write once

�ݺ�ߣ

The Business Intelligence for Hadoop Benchmark - Q1 2016

More Related Content

The Business Intelligence for Hadoop Benchmark - Q1 2016