This document provides a high-level overview of the Apache Hadoop ecosystem and some of its core components like HDFS, MapReduce, and Apache Hive. It also includes a brief and simplified example of how to perform basic operations with SQL, such as selecting rows and fields, filtering results, and joining tables.
2. Thirty
Seconds
About
Alex
≒ Solu@ons
Architect
≒ aka
consultant
≒ government
≒ infrastructure
≒ former
coder
of
Perl
≒ former
administrator
≒ likes
shiny
objects
2
3. What
Does
Cloudera
Do?
≒ product
≒ distribu@on
of
Hadoop
components,
Apache
licensed
≒ enterprise
tooling
≒ support
≒ training
≒ services
(aka
consul@ng)
≒ community
3
4. Disclaimer
≒ Cloudera
builds
things
soMware
≒ most
donated
to
Apache
≒ some
closed-足source
≒ Cloudera
products
I
reference
are
open
source
≒ Apache
Licensed
≒ source
code
is
on
GitHub
≒ hSps://github.com/cloudera
4
5. What
This
Talk
Isnt
About
≒ deploying
≒ Puppet,
Chef,
Ansible,
homegrown
scripts,
intern
labor
≒ sizing
&
tuning
≒ depends
heavily
on
data
and
workload
≒ coding
≒ unless
you
count
XML
or
CSV
or
SQL
≒ algorithms
5
7. Why
Ecosystem?
≒ In
the
beginning,
just
Hadoop
≒ HDFS
≒ MapReduce
≒ Today,
dozens
of
interrelated
components
≒ I/O
≒ Processing
≒ Specialty
Applica@ons
≒ Con鍖gura@on
≒ Work鍖ow
7
8. HDFS
≒ Distributed,
highly
fault-足tolerant
鍖lesystem
≒ Op@mized
for
large
streaming
access
to
data
≒ Based
on
Google
File
System
≒ hSp://research.google.com/archive/gfs.html
8
10. MapReduce
(MR)
≒ Programming
paradigm
≒ Batch
oriented,
not
real@me
≒ Works
well
with
distributed
compu@ng
≒ Lots
of
Java,
but
other
languages
supported
≒ Based
on
Googles
paper
≒ hSp://research.google.com/archive/mapreduce.html
10
17. 17
I
am
not
a
SQL
wizard
by
any
means
Super
Shady
SQL
Supplement
18. A
Simple
Rela@onal
Database
name
state
employer
year
Alex
Maryland
Cloudera
2013
Joey
Maryland
Cloudera
2011
Sean
Texas
Cloudera
2013
Paris
Maryland
AOL
2011
18
19. Interac@ng
with
Rela@onal
Data
name
state
employer
year
Alex
Maryland
Cloudera
2013
Joey
Maryland
Cloudera
2011
Sean
Texas
Cloudera
2013
Paris
Maryland
AOL
2011
19
SELECT
*
FROM
people;
20. Interac@ng
with
Rela@onal
Data
name
state
employer
year
Alex
Maryland
Cloudera
2013
Joey
Maryland
Cloudera
2011
Sean
Texas
Cloudera
2013
Paris
Maryland
AOL
2011
20
SELECT
*
FROM
people;
21. Reques@ng
Speci鍖c
Fields
name
state
employer
year
Alex
Maryland
Cloudera
2013
Joey
Maryland
Cloudera
2011
Sean
Texas
Cloudera
2013
Paris
Maryland
AOL
2011
21
SELECT
name,
state
FROM
people;
22. Reques@ng
Speci鍖c
Fields
name
state
employer
year
Alex
Maryland
Cloudera
2013
Joey
Maryland
Cloudera
2011
Sean
Texas
Cloudera
2013
Paris
Maryland
AOL
2011
22
SELECT
name,
state
FROM
people;
23. Reques@ng
Speci鍖c
Rows
name
state
employer
year
Alex
Maryland
Cloudera
2013
Joey
Maryland
Cloudera
2011
Sean
Texas
Cloudera
2013
Paris
Maryland
AOL
2011
23
SELECT
name,
state
FROM
people
WHERE
year
2012;
24. Reques@ng
Speci鍖c
Rows
name
state
employer
year
Alex
Maryland
Cloudera
2013
Joey
Maryland
Cloudera
2011
Sean
Texas
Cloudera
2013
Paris
Maryland
AOL
2011
24
SELECT
name,
state
FROM
people
WHERE
year
2012;
25. Two
Simple
Tables
owner
species
name
Alex
Cactus
Marvin
Joey
Cat
Brain
Sean
None
Paris
Unknown
25
name
state
employer
year
Alex
Maryland
Cloudera
2013
Joey
Maryland
Cloudera
2011
Sean
Texas
Cloudera
2013
Paris
Maryland
AOL
2011
26. Joining
Two
Tables
owner
species
name
Alex
Cactus
Marvin
Joey
Cat
Brain
Sean
None
Paris
Unknown
26
SELECT
people.name
AS
owner,
people.state
AS
state,
pets.name
AS
pet
FROM
people
LEFT
JOIN
pets
ON
people.name
=
pets.owner
name
state
employer
year
Alex
Maryland
Cloudera
2013
Joey
Maryland
Cloudera
2011
Sean
Texas
Cloudera
2013
Paris
Maryland
AOL
2011
27. Joining
Two
Tables
owner
species
name
Alex
Cactus
Marvin
Joey
Cat
Brain
Sean
None
Paris
Unknown
27
SELECT
people.name
AS
owner,
people.state
AS
state,
pets.name
AS
pet
FROM
people
LEFT
JOIN
pets
ON
people.name
=
pets.owner
name
state
employer
year
Alex
Maryland
Cloudera
2013
Joey
Maryland
Cloudera
2011
Sean
Texas
Cloudera
2013
Paris
Maryland
AOL
2011
28. Joining
Two
Tables
owner
species
name
Alex
Cactus
Marvin
Joey
Cat
Brain
Sean
None
Paris
Unknown
28
SELECT
people.name
AS
owner,
people.state
AS
state,
pets.name
AS
pet
FROM
people
LEFT
JOIN
pets
ON
people.name
=
pets.owner
name
state
employer
year
Alex
Maryland
Cloudera
2013
Joey
Maryland
Cloudera
2011
Sean
Texas
Cloudera
2013
Paris
Maryland
AOL
2011
29. Joining
Two
Tables
29
SELECT
people.name
AS
owner,
people.state
AS
state,
pets.name
AS
pet
FROM
people
LEFT
JOIN
pets
ON
people.name
=
pets.owner
owner
state
pet
Alex
Maryland
Marvin
Joey
Maryland
Brain
Sean
Texas
Paris
Maryland
30. Varying
Implementa@on
of
JOIN
30
SELECT
people.name
AS
owner,
people.state
AS
state,
pets.name
AS
pet
FROM
people
LEFT
JOIN
pets
ON
people.name
=
pets.owner
owner
state
pet
Alex
Maryland
Marvin
Joey
Maryland
Brain
Sean
Texas
?
Paris
Maryland
?
32. Cloudera
Impala
≒ Interac@ve
query
on
Hadoop
≒ think
seconds,
not
minutes
≒ Nearly
ANSI-足92
standard
SQL
≒ compa@ble
with
HiveQL
≒ Na@ve
MPP
query
engine
≒ built
for
low-足latency
queries
32
33. Cloudera
Impala
Design
Choices
≒ Na@ve
daemons,
wriSen
in
C/C++
≒ No
JVM,
no
MapReduce
≒ Saturate
disks
on
reads
≒ Uses
in-足memory
HDFS
caching
≒ Re-足uses
Hive
metastore
≒ Not
as
fault-足tolerant
as
MapReduce
33
34. Cloudera
Impala
Architecture
≒ Impala
Daemon
≒ runs
on
every
node
≒ handles
client
requests
≒ handles
query
planning
execu@on
≒ State
Store
Daemon
≒ provides
name
service
≒ metadata
distribu@on
≒ used
for
鍖nding
data
34
36. Impala
Query
Execu@on
36
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
SQL
App
ODBC
Hive
Metastore
HDFS
NN
Statestore
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
2)
Planner
turns
request
into
collecRons
of
plan
fragments
3)
Coordinator
iniRates
execuRon
on
impalad(s)
local
to
data
37. Impala
Query
Execu@on
37
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
SQL
App
ODBC
Hive
Metastore
HDFS
NN
Statestore
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
Query
Planner
Query
Coordinator
Query
Executor
HDFS
DN
HBase
4)
Intermediate
results
are
streamed
between
impalad(s)
5)
Query
results
are
streamed
back
to
client
Query
results
38. Cloudera
Impala
Results
≒ Allows
for
fast
itera@on/discovery
≒ How
much
faster?
≒ 3-足4x
faster
on
I/O
bound
workloads
≒ up
to
45x
faster
on
mul@-足MR
queries
≒ up
to
90x
faster
on
in-足memory
cache
38