This document discusses Hortonworks and its mission to enable modern data architectures through Apache Hadoop. It provides details on Hortonworks' commitment to open source development through Apache, engineering Hadoop for enterprise use, and integrating Hadoop with existing technologies. The document outlines Hortonworks' services and the Hortonworks Data Platform (HDP) for storage, processing, and management of data in Hadoop. It also discusses Hortonworks' contributions to Apache Hadoop and related projects as well as enhancing SQL capabilities and performance in Apache Hive.
1 of 36
Downloaded 11 times
More Related Content
Hortonworks.bdb
1. Hortonworks: We Do Hadoop.
Our mission is to enable your Modern Data Architecture
by Delivering Enterprise Apache Hadoop
March 2014
2. Our Mission:
Our Commitment
Open Leadership
Drive innovation in the open exclusively via the
Apache community-driven open source process
Enterprise Rigor
Engineer, test and certify Apache Hadoop with
the enterprise in mind
Ecosystem Endorsement
Focus on deep integration with existing data
center technologies and skills
Page 2
Headquarters: Palo Alto, CA
Employees: 300+ and growing
Trusted Partners
Enable your Modern Data Architecture by
Delivering Enterprise Apache Hadoop
4. 1Key Services
Platform, Operational and
Data services essential for
the enterprise
Skills
Leverage your existing
skills: development,
analytics, operations
2
Requirements for Enterprise Hadoop
Page 4
CORE
SERVICES
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
OPERATIONAL
SERVICES
HDFS
SQOOP
FLUME
NFS
WebHDFS
KNOX*
OOZIE
AMBARI
FALCON*
YARN
MAP
TEZREDUCE
HIVE &
HCATALOG
PIGHBASE
Integration
Interoperable with existing
data center investments3
OPERATIONAL
SERVICES
DATA
SERVICES
CORE SERVICES
Schedule
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
Storage
Resource Management
Process
Data
Movement
Cluster
Mgmt Dataset
Mgmt
Data Access
Data
Security
5. 1Key Services
Platform, Operational and
Data services essential for
the enterprise
Skills
Leverage your existing
skills: development,
analytics, operations
2
HDP: A Complete Hadoop Distribution
Page 5
OS/VM Cloud Appliance
CORE
SERVICES
CORE
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
HORTONWORKS
DATA PLATFORM (HDP)
OPERATIONAL
SERVICES
DATA
SERVICES
HDFS
SQOOP
FLUME
NFS
LOAD &
EXTRACT
WebHDFS
KNOX*
OOZIE
AMBARI
FALCON*
YARN
MAP
TEZREDUCE
HIVE &
HCATALOG
PIGHBASE
Integration
Interoperable with existing
data center investments3
OPERATIONAL
SERVICES
DATA
SERVICES
CORE SERVICES
HORTONWORKS
DATA PLATFORM (HDP)
Schedule
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
Storage
Resource Management
Process
Data
Movement
Cluster
Mgmnt Dataset
Mgmnt
Data Access
CORE SERVICES
HORTONWORKS
DATA PLATFORM (HDP)
OPERATIONAL
SERVICES
DATA
SERVICES
HDFS
SQOOP
FLUMEAMBARI
FALCON
YARN
MAP
TEZREDUCE
HIVEPIG
HBASE
OOZIE
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
LOAD &
EXTRACT
WebHDFS
NFS
KNOX
6. Store all date in a single place, interact in multiple ways
Hadoop 2: The Introduction of YARN
1st Gen of
Hadoop
HDFS
(redundant, reliable storage)
MapReduce
(cluster resource management
& data processing)
HADOOP 2
Single Use System
Batch Apps
Multi Use Data Platform
Batch, Interactive, Online, Streaming,
Page 6
Redundant, Reliable Storage
(HDFS)
Efficient Cluster Resource
Management & Shared Services
(YARN)
Standard Query
Processing
Hive, Pig
Batch
MapReduce
Interactive
Tez
Online Data
Processing
HBase, Accumulo
Real Time Stream
Processing
Storm
others
7. Apache Hadoop YARN
Page 7
Flexible
Enables other purpose-built data
processing models beyond
MapReduce (batch), such as
interactive and streaming
Efficient
Double processing IN Hadoop on
the same hardware while
providing predictable
performance & quality of service
Shared
Provides a stable, reliable,
secure foundation and
shared operational services
across multiple workloads
The data operating system for Hadoop 2.0
Data Processing Engines Run Natively IN Hadoop
BATCH
MapReduce
INTERACTIVE
Tez
STREAMING
Storm
IN-MEMORY
Spark
GRAPH
Giraph
SAS
LASR, HPA
ONLINE
HBase, Accumulo
OTHERS
HDFS: Redundant, Reliable Storage
YARN: Cluster Resource Management
8. Driving Our Innovation Through Apache
147,933 lines
614,041 lines
End Users
449,768 lines
Total Net Lines Contributed
to Apache Hadoop
Yahoo: 10
Cloudera: 7
IBM: 3
10 Others
21
Facebook: 5
LinkedIn: 3
Total Number of Committers
to Apache Hadoop
63
total
Hortonworks mission is
to power your modern data architecture by enabling
Hadoop to be an enterprise data platform that
deeply integrates with your data center technologies
Page 8
Apache
Project
Committers
PMC
Members
Hadoop 21 13
Tez 10 4
Hive 11 3
HBase 8 3
Pig 6 5
Sqoop 1 0
Ambari 20 12
Knox 6 2
Falcon 2 2
Oozie 2 2
Zookeepe
r
2 1
Flume 1 0
Accumulo 2 2
Storm 1 0
Drill 1 0
TOTAL 95 48
9. Patterns for Hadoop Applications
Page 9
1
Integration
Interoperable with existing
data center investments
Key Services
Platform, operational and
data services essential for
the enterprise
Skills
Leverage your existing
skills: development,
analytics, operations
2
3 DEVELOPANALYZEOPERATE
COLLECT PROCESS BUILD
EXPLORE QUERY DELIVER
PROVISION MANAGE MONITOR
10. Familiar and Existing Tools
Page 10
1Key Services
Platform, operational and
data services essential for
the enterprise
Skills
Leverage your existing
skills: development,
analytics, operations
2
DEVELOPANALYZEOPERATE
COLLECT PROCESS BUILD
EXPLORE QUERY DELIVER
PROVISION MANAGE MONITOR
BusinessObjects BI
Integration
Interoperable with existing
data center investments3
11. SQL Interactive Query & Apache Hive
Page 11
1Key Services
Platform, operational and
data services essential for
the enterprise
Skills
Leverage your existing
skills: development,
analytics, operations
2
Integration
Interoperable with existing
data center investments3
Stinger Initiative
Broad, community based effort to deliver the
next generation of Apache Hive
Scale
The only SQL interface
to Hadoop designed for
queries that scale from
TB to PB
SQL
Support broadest range
of SQL semantics for
analytic applications
against Hadoop
Speed
Improve Hive query
performance by 100X to
allow for interactive
query times (seconds)
SQL
Apache Hive
The defacto standard for Hadoop SQL access
Used by your current data center partners
Built for batch AND interactive query
12. APPLICATIONSDATASYSTEM
REPOSITORIES
SOURCES
Existing Sources
(CRM, ERP, Clickstream, Logs)
RDBMS EDW MPP
Emerging Sources
(Sensor, Sentiment, Geo, Unstructured)
OPERATIONAL
TOOLS
MANAGE &
MONITOR
DEV & DATA
TOOLS
BUILD &
TEST
Business
Analytics
Custom
Applications
Packaged
Applications
Requirements for Enterprise Hadoop
Page 12
Integration
Interoperable with existing
data center investments3
Integrate with
Applications
Business Intelligence,
Developer IDEs,
Data Integration
Systems
Data Systems & Storage,
Systems Management
Platforms
Operating Systems,
Virtualization, Cloud,
Appliances
13. Broad Ecosystem Integration
Page 13
APPLICATIONSDATASYSTEMSOURCES
RDBMS EDW MPP
Emerging Sources
(Sensor, Sentiment, Geo, Unstructured)
HANA
BusinessObjects BI
OPERATIONAL TOOLS
DEV & DATA TOOLS
Existing Sources
(CRM, ERP, Clickstream, Logs)
INFRASTRUCTURE
14. Apache Hive and Stinger:
SQL in Hadoop
Arun Murthy (@acmurthy)
Alan Gates (@alanfgates)
Owen OMalley (@owen_omalley)
@hortonworks
15. Stinger Project
(announced February 2013)
Batch AND Interactive SQL-IN-Hadoop
Stinger Initiative
A broad, community-based effort to
drive the next generation of HIVE
Coming Soon:
Hive on Apache Tez
Query Service
Buffer Cache
Cost Based Optimizer (Optiq)
Vectorized Processing
Hive 0.11, May 2013:
Base Optimizations
SQL Analytic Functions
ORCFile, Modern File Format
Hive 0.12, October 2013:
VARCHAR, DATE Types
ORCFile predicate pushdown
Advanced Optimizations
Performance Boosts via YARN
Speed
Improve Hive query performance by 100X to
allow for interactive query times (seconds)
Scale
The only SQL interface to Hadoop designed
for queries that scale from TB to PB
SQL
Support broadest range of SQL semantics for
analytic applications running against Hadoop
all IN Hadoop
Goals:
16. Hive 0.12
Hive 0.12
Release Theme Speed, Scale and SQL
Specific Features 10x faster query launch when using large number
(500+) of partitions
ORCFile predicate pushdown speeds queries
Evaluate LIMIT on the map side
Parallel ORDER BY
New query optimizer
Introduces VARCHAR and DATE datatypes
GROUP BY on structs or unions
Included
Components
Apache Hive 0.12
17. SPEED: Increasing Hive Performance
Performance Improvements
included in Hive 12
Base & advanced query optimization
Startup time improvement
Join optimizations
Interactive Query Times across ALL use cases
Simple and advanced queries in seconds
Integrates seamlessly with existing tools
Currently a >100x improvement in just nine months
18. Stinger Phase 3: Unlocking Interactive Query
Page 18
Stinger Phase 3: Features and Benefits
Container Pre-Launch
Overcomes Java VM startup latency by pre-
launching hot containers ready to serve queries
Container Re-Use
Finished Maps and Reduces pick up more work
rather than exiting. Reduces latency and
eliminates difficult split size tuning
Tez Integration
Tez Broadcast Edge and Intermediate Reduce
pattern improve query scale and throughput
In-Memory Cache Hot data kept in RAM for fast access
19. Stinger Phase 3: Speed, Scale, and SQL
Page 19
Release Theme Prove Hive for both large-scale and interactive SQL /
analytics
Specific Features < 10s SQL queries over 200GB datasets through Hive
Tez container pre-launch
Tez container re-use
Use of Tez Intermediate Reduce pattern
In-memory HDFS caching
Made available as part of the Tech Preview for Stinger Phase 3
20. Stinger Phase 3: Beyond Tech Preview
Page 20
Release Theme Speed, SQL,and Security
Specific Features Hive-on-Tez: Interactive query on Hive
SQL Improvements:
Sub-query for WHERE
Standard JOIN semantics
Support for Common Table Expressions (CTE)
Phase 1 of ACID Semantics support
Automatic JOIN order optimization
CHAR datatype
PAM authentication support
SSL encryption
21. SQL: Enhancing SQL Semantics
Hive SQL Datatypes Hive SQL Semantics
INT SELECT, INSERT
TINYINT/SMALLINT/BIGINT GROUP BY, ORDER BY, SORT BY
BOOLEAN JOIN on explicit join key
FLOAT Inner, outer, cross and semi joins
DOUBLE Sub-queries in FROM clause
STRING ROLLUP and CUBE
TIMESTAMP UNION
BINARY Windowing Functions (OVER, RANK, etc)
DECIMAL Custom Java UDFs
ARRAY, MAP, STRUCT, UNION Standard Aggregation (SUM, AVG, etc.)
DATE Advanced UDFs (ngram, Xpath, URL)
VARCHAR Sub-queries in WHERE, HAVING
CHAR Expanded JOIN Syntax
SQL Compliant Security (GRANT, etc.)
INSERT/UPDATE/DELETE (ACID)
Hive 0.12
Available
Roadmap
SQL Compliance
Hive 12 provides a wide
array of SQL datatypes
and semantics so your
existing tools integrate
more seamlessly with
Hadoop
23. Vectorized Query Execution
Designed for Modern Processor Architectures
Avoid branching in the inner loop.
Make the most use of L1 and L2 cache.
How It Works
Process records in batches of 1,000 rows
Generate code from templates to minimize branching.
What It Gives
30x improvement in rows processed per second.
Initial prototype: 100M rows/sec on laptop
Page 23
26. Hive MR Hive Tez
Hive-on-MR vs. Hive-on-Tez
SELECT a.x, AVERAGE(b.y) AS avg
FROM a JOIN b ON (a.id = b.id) GROUP BY a
UNION SELECT x, AVERAGE(y) AS AVG
FROM c GROUP BY x
ORDER BY AVG;
SELECT a.state
JOIN (a, c)
SELECT c.price
SELECT b.id
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state,
c.itemId
JOIN (a, c)
JOIN(a, b)
GROUP BY a.state
COUNT(*)
AVERAGE(c.price)
SELECT b.id
Tez avoids
unneeded writes to
HDFS
27. Tez Delivers Interactive Query - Out of the Box!
Page 27
Feature Description Benefit
Tez Session
Overcomes Map-Reduce job-launch latency by pre-
launching Tez AppMaster
Latency
Tez Container Pre-
Launch
Overcomes Map-Reduce latency by pre-launching
hot containers ready to serve queries.
Latency
Tez Container Re-Use
Finished maps and reduces pick up more work
rather than exiting. Reduces latency and eliminates
difficult split-size tuning. Out of box performance!
Latency
Runtime re-
configuration of DAG
Runtime query tuning by picking aggregation
parallelism using online query statistics
Throughput
Tez In-Memory Cache Hot data kept in RAM for fast access. Latency
Complex DAGs
Tez Broadcast Edge and Map-Reduce-Reduce
pattern improve query scale and throughput.
Throughput
34. How Stinger Phase 3 Delivers Interactive Query
Page 34
Feature Description Benefit
Tez Integration Tez is significantly better engine than MapReduce Latency
Vectorized Query
Take advantage of modern hardware by processing
thousand-row blocks rather than row-at-a-time.
Throughput
Query Planner
Using extensive statistics now available in Metastore
to better plan and optimize query, including
predicate pushdown during compilation to eliminate
portions of input (beyond partition pruning)
Latency
Cost Based Optimizer
(Optiq)
Join re-ordering and other optimizations based on
column statistics including histograms etc.
Latency
36. Hortonworks: The Value of Open for You
Page 36
Validate & Try
1. Download the
Hortonworks Sandbox
2. Learn Hadoop using the
technical tutorials
3. Investigate a business
case using the step-by-
step business cases
scenarios
4. Validate YOUR business
case using your data in
the sandbox
Connect With the Hadoop Community
We employ a large number of Apache project committers & innovators so
that you are represented in the open source community
Avoid Vendor Lock-In
Hortonworks Data Platform remain as close to the open source trunk as
possible and is developed 100% in the open so you are never locked in
The Partners you Rely On, Rely On Hortonworks
We work with partners to deeply integrate Hadoop with data center
technologies so you can leverage existing skills and investments
Certified for the Enterprise
We engineer, test and certify the Hortonworks Data Platform at scale to
ensure reliability and stability you require for enterprise use
Support from the Experts
We provide the highest quality of support for deploying at scale. You are
supported by hundreds of years of Hadoop experience
Engage
1. Execute a Business Case
Discovery Workshop with
our architects
2. Build a business case for
Hadoop today
Editor's Notes
#2: Hello Today Im going to talk to you about HW and how we deliver an Enterprise Ready Hadoop to enable your modern data architecture.
#3: Founded just 2.5 years ago from the original hadoop team members a yahoo.Hortonworks emerged as the leader in open source Hadoop.We are commited to ensure H is an enterprise viable data platform ready for your modern data architectureOur team is probably the largest assembled team of Hadoop experts and active leaders in the communityWe not only make sure Hadoop meets all your enterprise requirements likeOperations, reliablity & SecurityIt also needs to bePackaged & Tested and we do this.It has to work with what you have Make Hadoop an enterprise data platform. Make the market function.Innovate core platform, data, & operational servicesIntegrate deeply with enterprise ecosystemProvide world-class enterprise supportDrive 100% open source software development and releases through the core Apache projectsAddress enterprise needs in community projectsEstablish Apache foundation projects as the standardPromote open community vs. vendor control / lock-inEnable the Hadoop market to functionMake it easy for enterprises to deploy at scaleBe the best at enabling deep ecosystem integrationCreate a pull market with key strategic partners
#8: The first wave of Hadoop was about HDFS and MapReduce where MapReduce had a split brain, so to speak. It was a framework for massive distributed data processing, but it also had all of the Job Management capabilities built into it.The second wave of Hadoop is upon us and a component called YARN has emerged that generalizes Hadoops Cluster Resource Management in a way where MapReduce is NOW just one of many frameworks or applications that can run atop YARN. Simply put, YARN is the distributed operating system for data processing applications. For those curious, YARN stands for Yet Another Resource Negotiator.[CLICK] As I like to say, YARN enables applications to run natively IN Hadoop versus ON HDFS or next to Hadoop. [CLICK] Why is that important? Businesses do NOT want to stovepipe clusters based on batch processing versus interactive SQL versus online data serving versus real-time streaming use cases. They're adopting a big data strategy so they can get ALL of their data in one place and access that data in a wide variety of ways. With predictable performance and quality of service. [CLICK] This second wave of Hadoop represents a major rearchitecture that has been underway for 3 or 4 years. And this slide shows just a sampling of open source projects that are or will be leveraging YARN in the not so distant future.For example, engineers at Yahoo have shared open source code that enables Twitter Storm to run on YARN. Apache Giraph is a graph processing system that is YARN enabled. Spark is an in-memory data processing system built at Berkeley thats been recently contributed to the Apache Software Foundation. OpenMPI is an open source Message Passing Interface system for HPC that works on YARN. These are just a few examples.
#22: With Hive and Stinger we are focused on enabling the SQL ecosystem and to do that weve put Hive on a clear roadmap to SQL compliance.That includes adding critical datatypes like character and date types as well as implementing common SQL semantics seen in most databases.
#30: query 52 star join followed by group/order (different keys), selective filterquery 55 same
#31: query 28: 4subquery joinquery 12: star join over range of dates
#32: query 1: SELECT pageURL, pageRank FROM rankings WHERE pageRank > X
#33: SELECT SUBSTR(sourceIP, 1, X), SUM(adRevenue) FROM uservisits GROUP BYSUBSTR(sourceIP, 1, X)
#34: SELECT sourceIP, totalRevenue, avgPageRankFROM (SELECT sourceIP, AVG(pageRank) as avgPageRank, SUM(adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(`1980-01-01') AND Date(`X') GROUP BY UV.sourceIP)ORDER BY totalRevenue DESC LIMIT 1
#37: Make Hadoop an enterprise data platformInnovate core platform, data, & operational servicesIntegrate deeply with enterprise ecosystemProvide world-class enterprise supportDrive 100% open source software development and releases through the core Apache projectsAddress enterprise needs in community projectsEstablish Apache foundation projects as the standardPromote open community vs. vendor control / lock-inEnable the Hadoop market to functionMake it easy for enterprises to deploy at scaleBe the best at enabling deep ecosystem integrationCreate a pull market with key strategic partners