際際滷

際際滷Share a Scribd company logo
Hortonworks: We Do Hadoop.
Our mission is to enable your Modern Data Architecture
by Delivering Enterprise Apache Hadoop
March 2014
Our Mission:
Our Commitment
Open Leadership
Drive innovation in the open exclusively via the
Apache community-driven open source process
Enterprise Rigor
Engineer, test and certify Apache Hadoop with
the enterprise in mind
Ecosystem Endorsement
Focus on deep integration with existing data
center technologies and skills
Page 2
Headquarters: Palo Alto, CA
Employees: 300+ and growing
Trusted Partners
Enable your Modern Data Architecture by
Delivering Enterprise Apache Hadoop
Requirements for
Enterprise Hadoop in the
Modern Data Architecture
Page 3
1Key Services
Platform, Operational and
Data services essential for
the enterprise
Skills
Leverage your existing
skills: development,
analytics, operations
2
Requirements for Enterprise Hadoop
Page 4
CORE
SERVICES
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
OPERATIONAL
SERVICES
HDFS
SQOOP
FLUME
NFS
WebHDFS
KNOX*
OOZIE
AMBARI
FALCON*
YARN
MAP
TEZREDUCE
HIVE &
HCATALOG
PIGHBASE
Integration
Interoperable with existing
data center investments3
OPERATIONAL
SERVICES
DATA
SERVICES
CORE SERVICES
Schedule
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
Storage
Resource Management
Process
Data
Movement
Cluster
Mgmt Dataset
Mgmt
Data Access
Data
Security
1Key Services
Platform, Operational and
Data services essential for
the enterprise
Skills
Leverage your existing
skills: development,
analytics, operations
2
HDP: A Complete Hadoop Distribution
Page 5
OS/VM Cloud Appliance
CORE
SERVICES
CORE
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
HORTONWORKS
DATA PLATFORM (HDP)
OPERATIONAL
SERVICES
DATA
SERVICES
HDFS
SQOOP
FLUME
NFS
LOAD &
EXTRACT
WebHDFS
KNOX*
OOZIE
AMBARI
FALCON*
YARN
MAP
TEZREDUCE
HIVE &
HCATALOG
PIGHBASE
Integration
Interoperable with existing
data center investments3
OPERATIONAL
SERVICES
DATA
SERVICES
CORE SERVICES
HORTONWORKS
DATA PLATFORM (HDP)
Schedule
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
Storage
Resource Management
Process
Data
Movement
Cluster
Mgmnt Dataset
Mgmnt
Data Access
CORE SERVICES
HORTONWORKS
DATA PLATFORM (HDP)
OPERATIONAL
SERVICES
DATA
SERVICES
HDFS
SQOOP
FLUMEAMBARI
FALCON
YARN
MAP
TEZREDUCE
HIVEPIG
HBASE
OOZIE
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
LOAD &
EXTRACT
WebHDFS
NFS
KNOX
Store all date in a single place, interact in multiple ways
Hadoop 2: The Introduction of YARN
1st Gen of
Hadoop
HDFS
(redundant, reliable storage)
MapReduce
(cluster resource management
& data processing)
HADOOP 2
Single Use System
Batch Apps
Multi Use Data Platform
Batch, Interactive, Online, Streaming, 
Page 6
Redundant, Reliable Storage
(HDFS)
Efficient Cluster Resource
Management & Shared Services
(YARN)
Standard Query
Processing
Hive, Pig
Batch
MapReduce
Interactive
Tez
Online Data
Processing
HBase, Accumulo
Real Time Stream
Processing
Storm
others
Apache Hadoop YARN
Page 7
Flexible
Enables other purpose-built data
processing models beyond
MapReduce (batch), such as
interactive and streaming
Efficient
Double processing IN Hadoop on
the same hardware while
providing predictable
performance & quality of service
Shared
Provides a stable, reliable,
secure foundation and
shared operational services
across multiple workloads
The data operating system for Hadoop 2.0
Data Processing Engines Run Natively IN Hadoop
BATCH
MapReduce
INTERACTIVE
Tez
STREAMING
Storm
IN-MEMORY
Spark
GRAPH
Giraph
SAS
LASR, HPA
ONLINE
HBase, Accumulo
OTHERS
HDFS: Redundant, Reliable Storage
YARN: Cluster Resource Management
Driving Our Innovation Through Apache
147,933 lines
614,041 lines
End Users
449,768 lines
Total Net Lines Contributed
to Apache Hadoop
Yahoo: 10
Cloudera: 7
IBM: 3
10 Others
21
Facebook: 5
LinkedIn: 3
Total Number of Committers
to Apache Hadoop
63
total
Hortonworks mission is
to power your modern data architecture by enabling
Hadoop to be an enterprise data platform that
deeply integrates with your data center technologies
Page 8
Apache
Project
Committers
PMC
Members
Hadoop 21 13
Tez 10 4
Hive 11 3
HBase 8 3
Pig 6 5
Sqoop 1 0
Ambari 20 12
Knox 6 2
Falcon 2 2
Oozie 2 2
Zookeepe
r
2 1
Flume 1 0
Accumulo 2 2
Storm 1 0
Drill 1 0
TOTAL 95 48
Patterns for Hadoop Applications
Page 9
1
Integration
Interoperable with existing
data center investments
Key Services
Platform, operational and
data services essential for
the enterprise
Skills
Leverage your existing
skills: development,
analytics, operations
2
3 DEVELOPANALYZEOPERATE
COLLECT PROCESS BUILD
EXPLORE QUERY DELIVER
PROVISION MANAGE MONITOR
Familiar and Existing Tools
Page 10
1Key Services
Platform, operational and
data services essential for
the enterprise
Skills
Leverage your existing
skills: development,
analytics, operations
2
DEVELOPANALYZEOPERATE
COLLECT PROCESS BUILD
EXPLORE QUERY DELIVER
PROVISION MANAGE MONITOR
BusinessObjects BI
Integration
Interoperable with existing
data center investments3
SQL Interactive Query & Apache Hive
Page 11
1Key Services
Platform, operational and
data services essential for
the enterprise
Skills
Leverage your existing
skills: development,
analytics, operations
2
Integration
Interoperable with existing
data center investments3
Stinger Initiative
Broad, community based effort to deliver the
next generation of Apache Hive
Scale
The only SQL interface
to Hadoop designed for
queries that scale from
TB to PB
SQL
Support broadest range
of SQL semantics for
analytic applications
against Hadoop
Speed
Improve Hive query
performance by 100X to
allow for interactive
query times (seconds)
SQL
Apache Hive
 The defacto standard for Hadoop SQL access
 Used by your current data center partners
 Built for batch AND interactive query
APPLICATIONSDATASYSTEM
REPOSITORIES
SOURCES
Existing Sources
(CRM, ERP, Clickstream, Logs)
RDBMS EDW MPP
Emerging Sources
(Sensor, Sentiment, Geo, Unstructured)
OPERATIONAL
TOOLS
MANAGE &
MONITOR
DEV & DATA
TOOLS
BUILD &
TEST
Business
Analytics
Custom
Applications
Packaged
Applications
Requirements for Enterprise Hadoop
Page 12
Integration
Interoperable with existing
data center investments3
Integrate with
Applications
Business Intelligence,
Developer IDEs,
Data Integration
Systems
Data Systems & Storage,
Systems Management
Platforms
Operating Systems,
Virtualization, Cloud,
Appliances
Broad Ecosystem Integration
Page 13
APPLICATIONSDATASYSTEMSOURCES
RDBMS EDW MPP
Emerging Sources
(Sensor, Sentiment, Geo, Unstructured)
HANA
BusinessObjects BI
OPERATIONAL TOOLS
DEV & DATA TOOLS
Existing Sources
(CRM, ERP, Clickstream, Logs)
INFRASTRUCTURE
Apache Hive and Stinger:
SQL in Hadoop
Arun Murthy (@acmurthy)
Alan Gates (@alanfgates)
Owen OMalley (@owen_omalley)
@hortonworks
Stinger Project
(announced February 2013)
Batch AND Interactive SQL-IN-Hadoop
Stinger Initiative
A broad, community-based effort to
drive the next generation of HIVE
Coming Soon:
 Hive on Apache Tez
 Query Service
 Buffer Cache
 Cost Based Optimizer (Optiq)
 Vectorized Processing
Hive 0.11, May 2013:
 Base Optimizations
 SQL Analytic Functions
 ORCFile, Modern File Format
Hive 0.12, October 2013:
 VARCHAR, DATE Types
 ORCFile predicate pushdown
 Advanced Optimizations
 Performance Boosts via YARN
Speed
Improve Hive query performance by 100X to
allow for interactive query times (seconds)
Scale
The only SQL interface to Hadoop designed
for queries that scale from TB to PB
SQL
Support broadest range of SQL semantics for
analytic applications running against Hadoop
all IN Hadoop
Goals:
Hive 0.12
Hive 0.12
Release Theme Speed, Scale and SQL
Specific Features  10x faster query launch when using large number
(500+) of partitions
 ORCFile predicate pushdown speeds queries
 Evaluate LIMIT on the map side
 Parallel ORDER BY
 New query optimizer
 Introduces VARCHAR and DATE datatypes
 GROUP BY on structs or unions
Included
Components
Apache Hive 0.12
SPEED: Increasing Hive Performance
Performance Improvements
included in Hive 12
 Base & advanced query optimization
 Startup time improvement
 Join optimizations
Interactive Query Times across ALL use cases
 Simple and advanced queries in seconds
 Integrates seamlessly with existing tools
 Currently a >100x improvement in just nine months
Stinger Phase 3: Unlocking Interactive Query
Page 18
Stinger Phase 3: Features and Benefits
Container Pre-Launch
Overcomes Java VM startup latency by pre-
launching hot containers ready to serve queries
Container Re-Use
Finished Maps and Reduces pick up more work
rather than exiting. Reduces latency and
eliminates difficult split size tuning
Tez Integration
Tez Broadcast Edge and Intermediate Reduce
pattern improve query scale and throughput
In-Memory Cache Hot data kept in RAM for fast access
Stinger Phase 3: Speed, Scale, and SQL
Page 19
Release Theme Prove Hive for both large-scale and interactive SQL /
analytics
Specific Features  < 10s SQL queries over 200GB datasets through Hive
 Tez container pre-launch
 Tez container re-use
 Use of Tez Intermediate Reduce pattern
 In-memory HDFS caching
Made available as part of the Tech Preview for Stinger Phase 3
Stinger Phase 3: Beyond Tech Preview
Page 20
Release Theme Speed, SQL,and Security
Specific Features  Hive-on-Tez: Interactive query on Hive
 SQL Improvements:
 Sub-query for WHERE
 Standard JOIN semantics
 Support for Common Table Expressions (CTE)
 Phase 1 of ACID Semantics support
 Automatic JOIN order optimization
 CHAR datatype
 PAM authentication support
 SSL encryption
SQL: Enhancing SQL Semantics
Hive SQL Datatypes Hive SQL Semantics
INT SELECT, INSERT
TINYINT/SMALLINT/BIGINT GROUP BY, ORDER BY, SORT BY
BOOLEAN JOIN on explicit join key
FLOAT Inner, outer, cross and semi joins
DOUBLE Sub-queries in FROM clause
STRING ROLLUP and CUBE
TIMESTAMP UNION
BINARY Windowing Functions (OVER, RANK, etc)
DECIMAL Custom Java UDFs
ARRAY, MAP, STRUCT, UNION Standard Aggregation (SUM, AVG, etc.)
DATE Advanced UDFs (ngram, Xpath, URL)
VARCHAR Sub-queries in WHERE, HAVING
CHAR Expanded JOIN Syntax
SQL Compliant Security (GRANT, etc.)
INSERT/UPDATE/DELETE (ACID)
Hive 0.12
Available
Roadmap
SQL Compliance
Hive 12 provides a wide
array of SQL datatypes
and semantics so your
existing tools integrate
more seamlessly with
Hadoop
Hortonworks.bdb
Vectorized Query Execution
Designed for Modern Processor Architectures
Avoid branching in the inner loop.
Make the most use of L1 and L2 cache.
How It Works
Process records in batches of 1,000 rows
Generate code from templates to minimize branching.
What It Gives
30x improvement in rows processed per second.
Initial prototype: 100M rows/sec on laptop
Page 23
Hortonworks.bdb
Hortonworks.bdb
Hive  MR Hive  Tez
Hive-on-MR vs. Hive-on-Tez
SELECT a.x, AVERAGE(b.y) AS avg
FROM a JOIN b ON (a.id = b.id) GROUP BY a
UNION SELECT x, AVERAGE(y) AS AVG
FROM c GROUP BY x
ORDER BY AVG;
SELECT a.state
JOIN (a, c)
SELECT c.price
SELECT b.id
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,
c.itemId
JOIN (a, c)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
SELECT b.id
Tez avoids
unneeded writes to
HDFS
Tez Delivers Interactive Query - Out of the Box!
Page 27
Feature Description Benefit
Tez Session
Overcomes Map-Reduce job-launch latency by pre-
launching Tez AppMaster
Latency
Tez Container Pre-
Launch
Overcomes Map-Reduce latency by pre-launching
hot containers ready to serve queries.
Latency
Tez Container Re-Use
Finished maps and reduces pick up more work
rather than exiting. Reduces latency and eliminates
difficult split-size tuning. Out of box performance!
Latency
Runtime re-
configuration of DAG
Runtime query tuning by picking aggregation
parallelism using online query statistics
Throughput
Tez In-Memory Cache Hot data kept in RAM for fast access. Latency
Complex DAGs
Tez Broadcast Edge and Map-Reduce-Reduce
pattern improve query scale and throughput.
Throughput
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
Hortonworks.bdb
How Stinger Phase 3 Delivers Interactive Query
Page 34
Feature Description Benefit
Tez Integration Tez is significantly better engine than MapReduce Latency
Vectorized Query
Take advantage of modern hardware by processing
thousand-row blocks rather than row-at-a-time.
Throughput
Query Planner
Using extensive statistics now available in Metastore
to better plan and optimize query, including
predicate pushdown during compilation to eliminate
portions of input (beyond partition pruning)
Latency
Cost Based Optimizer
(Optiq)
Join re-ordering and other optimizations based on
column statistics including histograms etc.
Latency
Next Steps
 Blog
http://hortonworks.com/blog/delivering-on-stinger-a-phase-3-progress-update/
 Stinger Initiative
http://hortonworks.com/labs/stinger/
 Stinger Phase 3 Tech preview
 http://hortonworks.com/blog/announcing-stinger-phase-3-technical-preview/
 http://hadoopwrangler.com
Hortonworks: The Value of Open for You
Page 36
Validate & Try
1. Download the
Hortonworks Sandbox
2. Learn Hadoop using the
technical tutorials
3. Investigate a business
case using the step-by-
step business cases
scenarios
4. Validate YOUR business
case using your data in
the sandbox
Connect With the Hadoop Community
We employ a large number of Apache project committers & innovators so
that you are represented in the open source community
Avoid Vendor Lock-In
Hortonworks Data Platform remain as close to the open source trunk as
possible and is developed 100% in the open so you are never locked in
The Partners you Rely On, Rely On Hortonworks
We work with partners to deeply integrate Hadoop with data center
technologies so you can leverage existing skills and investments
Certified for the Enterprise
We engineer, test and certify the Hortonworks Data Platform at scale to
ensure reliability and stability you require for enterprise use
Support from the Experts
We provide the highest quality of support for deploying at scale. You are
supported by hundreds of years of Hadoop experience
Engage
1. Execute a Business Case
Discovery Workshop with
our architects
2. Build a business case for
Hadoop today

More Related Content

Hortonworks.bdb

  • 1. Hortonworks: We Do Hadoop. Our mission is to enable your Modern Data Architecture by Delivering Enterprise Apache Hadoop March 2014
  • 2. Our Mission: Our Commitment Open Leadership Drive innovation in the open exclusively via the Apache community-driven open source process Enterprise Rigor Engineer, test and certify Apache Hadoop with the enterprise in mind Ecosystem Endorsement Focus on deep integration with existing data center technologies and skills Page 2 Headquarters: Palo Alto, CA Employees: 300+ and growing Trusted Partners Enable your Modern Data Architecture by Delivering Enterprise Apache Hadoop
  • 3. Requirements for Enterprise Hadoop in the Modern Data Architecture Page 3
  • 4. 1Key Services Platform, Operational and Data services essential for the enterprise Skills Leverage your existing skills: development, analytics, operations 2 Requirements for Enterprise Hadoop Page 4 CORE SERVICES Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots OPERATIONAL SERVICES HDFS SQOOP FLUME NFS WebHDFS KNOX* OOZIE AMBARI FALCON* YARN MAP TEZREDUCE HIVE & HCATALOG PIGHBASE Integration Interoperable with existing data center investments3 OPERATIONAL SERVICES DATA SERVICES CORE SERVICES Schedule Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots Storage Resource Management Process Data Movement Cluster Mgmt Dataset Mgmt Data Access Data Security
  • 5. 1Key Services Platform, Operational and Data services essential for the enterprise Skills Leverage your existing skills: development, analytics, operations 2 HDP: A Complete Hadoop Distribution Page 5 OS/VM Cloud Appliance CORE SERVICES CORE Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots HORTONWORKS DATA PLATFORM (HDP) OPERATIONAL SERVICES DATA SERVICES HDFS SQOOP FLUME NFS LOAD & EXTRACT WebHDFS KNOX* OOZIE AMBARI FALCON* YARN MAP TEZREDUCE HIVE & HCATALOG PIGHBASE Integration Interoperable with existing data center investments3 OPERATIONAL SERVICES DATA SERVICES CORE SERVICES HORTONWORKS DATA PLATFORM (HDP) Schedule Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots Storage Resource Management Process Data Movement Cluster Mgmnt Dataset Mgmnt Data Access CORE SERVICES HORTONWORKS DATA PLATFORM (HDP) OPERATIONAL SERVICES DATA SERVICES HDFS SQOOP FLUMEAMBARI FALCON YARN MAP TEZREDUCE HIVEPIG HBASE OOZIE Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots LOAD & EXTRACT WebHDFS NFS KNOX
  • 6. Store all date in a single place, interact in multiple ways Hadoop 2: The Introduction of YARN 1st Gen of Hadoop HDFS (redundant, reliable storage) MapReduce (cluster resource management & data processing) HADOOP 2 Single Use System Batch Apps Multi Use Data Platform Batch, Interactive, Online, Streaming, Page 6 Redundant, Reliable Storage (HDFS) Efficient Cluster Resource Management & Shared Services (YARN) Standard Query Processing Hive, Pig Batch MapReduce Interactive Tez Online Data Processing HBase, Accumulo Real Time Stream Processing Storm others
  • 7. Apache Hadoop YARN Page 7 Flexible Enables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming Efficient Double processing IN Hadoop on the same hardware while providing predictable performance & quality of service Shared Provides a stable, reliable, secure foundation and shared operational services across multiple workloads The data operating system for Hadoop 2.0 Data Processing Engines Run Natively IN Hadoop BATCH MapReduce INTERACTIVE Tez STREAMING Storm IN-MEMORY Spark GRAPH Giraph SAS LASR, HPA ONLINE HBase, Accumulo OTHERS HDFS: Redundant, Reliable Storage YARN: Cluster Resource Management
  • 8. Driving Our Innovation Through Apache 147,933 lines 614,041 lines End Users 449,768 lines Total Net Lines Contributed to Apache Hadoop Yahoo: 10 Cloudera: 7 IBM: 3 10 Others 21 Facebook: 5 LinkedIn: 3 Total Number of Committers to Apache Hadoop 63 total Hortonworks mission is to power your modern data architecture by enabling Hadoop to be an enterprise data platform that deeply integrates with your data center technologies Page 8 Apache Project Committers PMC Members Hadoop 21 13 Tez 10 4 Hive 11 3 HBase 8 3 Pig 6 5 Sqoop 1 0 Ambari 20 12 Knox 6 2 Falcon 2 2 Oozie 2 2 Zookeepe r 2 1 Flume 1 0 Accumulo 2 2 Storm 1 0 Drill 1 0 TOTAL 95 48
  • 9. Patterns for Hadoop Applications Page 9 1 Integration Interoperable with existing data center investments Key Services Platform, operational and data services essential for the enterprise Skills Leverage your existing skills: development, analytics, operations 2 3 DEVELOPANALYZEOPERATE COLLECT PROCESS BUILD EXPLORE QUERY DELIVER PROVISION MANAGE MONITOR
  • 10. Familiar and Existing Tools Page 10 1Key Services Platform, operational and data services essential for the enterprise Skills Leverage your existing skills: development, analytics, operations 2 DEVELOPANALYZEOPERATE COLLECT PROCESS BUILD EXPLORE QUERY DELIVER PROVISION MANAGE MONITOR BusinessObjects BI Integration Interoperable with existing data center investments3
  • 11. SQL Interactive Query & Apache Hive Page 11 1Key Services Platform, operational and data services essential for the enterprise Skills Leverage your existing skills: development, analytics, operations 2 Integration Interoperable with existing data center investments3 Stinger Initiative Broad, community based effort to deliver the next generation of Apache Hive Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB SQL Support broadest range of SQL semantics for analytic applications against Hadoop Speed Improve Hive query performance by 100X to allow for interactive query times (seconds) SQL Apache Hive The defacto standard for Hadoop SQL access Used by your current data center partners Built for batch AND interactive query
  • 12. APPLICATIONSDATASYSTEM REPOSITORIES SOURCES Existing Sources (CRM, ERP, Clickstream, Logs) RDBMS EDW MPP Emerging Sources (Sensor, Sentiment, Geo, Unstructured) OPERATIONAL TOOLS MANAGE & MONITOR DEV & DATA TOOLS BUILD & TEST Business Analytics Custom Applications Packaged Applications Requirements for Enterprise Hadoop Page 12 Integration Interoperable with existing data center investments3 Integrate with Applications Business Intelligence, Developer IDEs, Data Integration Systems Data Systems & Storage, Systems Management Platforms Operating Systems, Virtualization, Cloud, Appliances
  • 13. Broad Ecosystem Integration Page 13 APPLICATIONSDATASYSTEMSOURCES RDBMS EDW MPP Emerging Sources (Sensor, Sentiment, Geo, Unstructured) HANA BusinessObjects BI OPERATIONAL TOOLS DEV & DATA TOOLS Existing Sources (CRM, ERP, Clickstream, Logs) INFRASTRUCTURE
  • 14. Apache Hive and Stinger: SQL in Hadoop Arun Murthy (@acmurthy) Alan Gates (@alanfgates) Owen OMalley (@owen_omalley) @hortonworks
  • 15. Stinger Project (announced February 2013) Batch AND Interactive SQL-IN-Hadoop Stinger Initiative A broad, community-based effort to drive the next generation of HIVE Coming Soon: Hive on Apache Tez Query Service Buffer Cache Cost Based Optimizer (Optiq) Vectorized Processing Hive 0.11, May 2013: Base Optimizations SQL Analytic Functions ORCFile, Modern File Format Hive 0.12, October 2013: VARCHAR, DATE Types ORCFile predicate pushdown Advanced Optimizations Performance Boosts via YARN Speed Improve Hive query performance by 100X to allow for interactive query times (seconds) Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB SQL Support broadest range of SQL semantics for analytic applications running against Hadoop all IN Hadoop Goals:
  • 16. Hive 0.12 Hive 0.12 Release Theme Speed, Scale and SQL Specific Features 10x faster query launch when using large number (500+) of partitions ORCFile predicate pushdown speeds queries Evaluate LIMIT on the map side Parallel ORDER BY New query optimizer Introduces VARCHAR and DATE datatypes GROUP BY on structs or unions Included Components Apache Hive 0.12
  • 17. SPEED: Increasing Hive Performance Performance Improvements included in Hive 12 Base & advanced query optimization Startup time improvement Join optimizations Interactive Query Times across ALL use cases Simple and advanced queries in seconds Integrates seamlessly with existing tools Currently a >100x improvement in just nine months
  • 18. Stinger Phase 3: Unlocking Interactive Query Page 18 Stinger Phase 3: Features and Benefits Container Pre-Launch Overcomes Java VM startup latency by pre- launching hot containers ready to serve queries Container Re-Use Finished Maps and Reduces pick up more work rather than exiting. Reduces latency and eliminates difficult split size tuning Tez Integration Tez Broadcast Edge and Intermediate Reduce pattern improve query scale and throughput In-Memory Cache Hot data kept in RAM for fast access
  • 19. Stinger Phase 3: Speed, Scale, and SQL Page 19 Release Theme Prove Hive for both large-scale and interactive SQL / analytics Specific Features < 10s SQL queries over 200GB datasets through Hive Tez container pre-launch Tez container re-use Use of Tez Intermediate Reduce pattern In-memory HDFS caching Made available as part of the Tech Preview for Stinger Phase 3
  • 20. Stinger Phase 3: Beyond Tech Preview Page 20 Release Theme Speed, SQL,and Security Specific Features Hive-on-Tez: Interactive query on Hive SQL Improvements: Sub-query for WHERE Standard JOIN semantics Support for Common Table Expressions (CTE) Phase 1 of ACID Semantics support Automatic JOIN order optimization CHAR datatype PAM authentication support SSL encryption
  • 21. SQL: Enhancing SQL Semantics Hive SQL Datatypes Hive SQL Semantics INT SELECT, INSERT TINYINT/SMALLINT/BIGINT GROUP BY, ORDER BY, SORT BY BOOLEAN JOIN on explicit join key FLOAT Inner, outer, cross and semi joins DOUBLE Sub-queries in FROM clause STRING ROLLUP and CUBE TIMESTAMP UNION BINARY Windowing Functions (OVER, RANK, etc) DECIMAL Custom Java UDFs ARRAY, MAP, STRUCT, UNION Standard Aggregation (SUM, AVG, etc.) DATE Advanced UDFs (ngram, Xpath, URL) VARCHAR Sub-queries in WHERE, HAVING CHAR Expanded JOIN Syntax SQL Compliant Security (GRANT, etc.) INSERT/UPDATE/DELETE (ACID) Hive 0.12 Available Roadmap SQL Compliance Hive 12 provides a wide array of SQL datatypes and semantics so your existing tools integrate more seamlessly with Hadoop
  • 23. Vectorized Query Execution Designed for Modern Processor Architectures Avoid branching in the inner loop. Make the most use of L1 and L2 cache. How It Works Process records in batches of 1,000 rows Generate code from templates to minimize branching. What It Gives 30x improvement in rows processed per second. Initial prototype: 100M rows/sec on laptop Page 23
  • 26. Hive MR Hive Tez Hive-on-MR vs. Hive-on-Tez SELECT a.x, AVERAGE(b.y) AS avg FROM a JOIN b ON (a.id = b.id) GROUP BY a UNION SELECT x, AVERAGE(y) AS AVG FROM c GROUP BY x ORDER BY AVG; SELECT a.state JOIN (a, c) SELECT c.price SELECT b.id JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) M M M R R M M R M M R M M R HDFS HDFS HDFS M M M R R R M M R R SELECT a.state, c.itemId JOIN (a, c) JOIN(a, b) GROUP BY a.state COUNT(*) AVERAGE(c.price) SELECT b.id Tez avoids unneeded writes to HDFS
  • 27. Tez Delivers Interactive Query - Out of the Box! Page 27 Feature Description Benefit Tez Session Overcomes Map-Reduce job-launch latency by pre- launching Tez AppMaster Latency Tez Container Pre- Launch Overcomes Map-Reduce latency by pre-launching hot containers ready to serve queries. Latency Tez Container Re-Use Finished maps and reduces pick up more work rather than exiting. Reduces latency and eliminates difficult split-size tuning. Out of box performance! Latency Runtime re- configuration of DAG Runtime query tuning by picking aggregation parallelism using online query statistics Throughput Tez In-Memory Cache Hot data kept in RAM for fast access. Latency Complex DAGs Tez Broadcast Edge and Map-Reduce-Reduce pattern improve query scale and throughput. Throughput
  • 34. How Stinger Phase 3 Delivers Interactive Query Page 34 Feature Description Benefit Tez Integration Tez is significantly better engine than MapReduce Latency Vectorized Query Take advantage of modern hardware by processing thousand-row blocks rather than row-at-a-time. Throughput Query Planner Using extensive statistics now available in Metastore to better plan and optimize query, including predicate pushdown during compilation to eliminate portions of input (beyond partition pruning) Latency Cost Based Optimizer (Optiq) Join re-ordering and other optimizations based on column statistics including histograms etc. Latency
  • 35. Next Steps Blog http://hortonworks.com/blog/delivering-on-stinger-a-phase-3-progress-update/ Stinger Initiative http://hortonworks.com/labs/stinger/ Stinger Phase 3 Tech preview http://hortonworks.com/blog/announcing-stinger-phase-3-technical-preview/ http://hadoopwrangler.com
  • 36. Hortonworks: The Value of Open for You Page 36 Validate & Try 1. Download the Hortonworks Sandbox 2. Learn Hadoop using the technical tutorials 3. Investigate a business case using the step-by- step business cases scenarios 4. Validate YOUR business case using your data in the sandbox Connect With the Hadoop Community We employ a large number of Apache project committers & innovators so that you are represented in the open source community Avoid Vendor Lock-In Hortonworks Data Platform remain as close to the open source trunk as possible and is developed 100% in the open so you are never locked in The Partners you Rely On, Rely On Hortonworks We work with partners to deeply integrate Hadoop with data center technologies so you can leverage existing skills and investments Certified for the Enterprise We engineer, test and certify the Hortonworks Data Platform at scale to ensure reliability and stability you require for enterprise use Support from the Experts We provide the highest quality of support for deploying at scale. You are supported by hundreds of years of Hadoop experience Engage 1. Execute a Business Case Discovery Workshop with our architects 2. Build a business case for Hadoop today

Editor's Notes

  • #2: Hello Today Im going to talk to you about HW and how we deliver an Enterprise Ready Hadoop to enable your modern data architecture.
  • #3: Founded just 2.5 years ago from the original hadoop team members a yahoo.Hortonworks emerged as the leader in open source Hadoop.We are commited to ensure H is an enterprise viable data platform ready for your modern data architectureOur team is probably the largest assembled team of Hadoop experts and active leaders in the communityWe not only make sure Hadoop meets all your enterprise requirements likeOperations, reliablity &amp; SecurityIt also needs to bePackaged &amp; Tested and we do this.It has to work with what you have Make Hadoop an enterprise data platform. Make the market function.Innovate core platform, data, &amp; operational servicesIntegrate deeply with enterprise ecosystemProvide world-class enterprise supportDrive 100% open source software development and releases through the core Apache projectsAddress enterprise needs in community projectsEstablish Apache foundation projects as the standardPromote open community vs. vendor control / lock-inEnable the Hadoop market to functionMake it easy for enterprises to deploy at scaleBe the best at enabling deep ecosystem integrationCreate a pull market with key strategic partners
  • #8: The first wave of Hadoop was about HDFS and MapReduce where MapReduce had a split brain, so to speak. It was a framework for massive distributed data processing, but it also had all of the Job Management capabilities built into it.The second wave of Hadoop is upon us and a component called YARN has emerged that generalizes Hadoops Cluster Resource Management in a way where MapReduce is NOW just one of many frameworks or applications that can run atop YARN. Simply put, YARN is the distributed operating system for data processing applications. For those curious, YARN stands for Yet Another Resource Negotiator.[CLICK] As I like to say, YARN enables applications to run natively IN Hadoop versus ON HDFS or next to Hadoop. [CLICK] Why is that important? Businesses do NOT want to stovepipe clusters based on batch processing versus interactive SQL versus online data serving versus real-time streaming use cases. They&apos;re adopting a big data strategy so they can get ALL of their data in one place and access that data in a wide variety of ways. With predictable performance and quality of service. [CLICK] This second wave of Hadoop represents a major rearchitecture that has been underway for 3 or 4 years. And this slide shows just a sampling of open source projects that are or will be leveraging YARN in the not so distant future.For example, engineers at Yahoo have shared open source code that enables Twitter Storm to run on YARN. Apache Giraph is a graph processing system that is YARN enabled. Spark is an in-memory data processing system built at Berkeley thats been recently contributed to the Apache Software Foundation. OpenMPI is an open source Message Passing Interface system for HPC that works on YARN. These are just a few examples.
  • #10: Platform ServicesWorkload ManagementMultitenancyHADRSnapshotsSecurityData ServicesStoreProcessAccessLifecycle ManagementOperational ServicesProvisionManageMonitorInteroperableToolsBusiness AnalystDeveloperData IntegrationInfrastructureData SystemsSystems ManagementDeployment PlatformsOS, VM, Cloud, Appliance
  • #13: Platform ServicesWorkload ManagementMultitenancyHADRSnapshotsSecurityData ServicesStoreProcessAccessLifecycle ManagementOperational ServicesProvisionManageMonitorInteroperableToolsBusiness AnalystDeveloperData IntegrationInfrastructureData SystemsSystems ManagementDeployment PlatformsOS, VM, Cloud, Appliance
  • #22: With Hive and Stinger we are focused on enabling the SQL ecosystem and to do that weve put Hive on a clear roadmap to SQL compliance.That includes adding critical datatypes like character and date types as well as implementing common SQL semantics seen in most databases.
  • #30: query 52 star join followed by group/order (different keys), selective filterquery 55 same
  • #31: query 28: 4subquery joinquery 12: star join over range of dates
  • #32: query 1: SELECT pageURL, pageRank FROM rankings WHERE pageRank &gt; X
  • #33: SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BYSUBSTR(sourceIP, 1, X)
  • #34: SELECT sourceIP, totalRevenue, avgPageRankFROM (SELECT sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(`1980-01-01&apos;) AND Date(`X&apos;) GROUP BY UV.sourceIP)ORDER BY totalRevenue DESC LIMIT 1
  • #37: Make Hadoop an enterprise data platformInnovate core platform, data, &amp; operational servicesIntegrate deeply with enterprise ecosystemProvide world-class enterprise supportDrive 100% open source software development and releases through the core Apache projectsAddress enterprise needs in community projectsEstablish Apache foundation projects as the standardPromote open community vs. vendor control / lock-inEnable the Hadoop market to functionMake it easy for enterprises to deploy at scaleBe the best at enabling deep ecosystem integrationCreate a pull market with key strategic partners