�ݺ�ߣ

1
Cloudera
Impala

Charm
City
Linux,
March
2014

Alex
Moundalexis

@technmsg

Thirty
Seconds
About
Alex

•  Solu@ons
Architect

•  aka
consultant

•  government

•  infrastructure

•  former
coder
of
Perl

•  former
administrator

•  likes
shiny
objects

2

What
Does
Cloudera
Do?

•  product

•  distribu@on
of
Hadoop
components,
Apache
licensed

•  enterprise
tooling

•  support

•  training

•  services
(aka
consul@ng)

•  community

3

Disclaimer

•  Cloudera
builds
things
soMware

•  most
donated
to
Apache

•  some
closed-‐source

•  Cloudera
“products”
I
reference
are
open
source

•  Apache
Licensed

•  source
code
is
on
GitHub

•  hSps://github.com/cloudera

4

What
This
Talk
Isn’t
About

•  deploying

•  Puppet,
Chef,
Ansible,
homegrown
scripts,
intern
labor

•  sizing
&
tuning

•  depends
heavily
on
data
and
workload

•  coding

•  unless
you
count
XML
or
CSV
or
SQL

•  algorithms

5

6
Quick
and
dirty,
for
context.

The
Apache
Hadoop
Ecosystem

Why
“Ecosystem?”

•  In
the
beginning,
just
Hadoop

•  HDFS

•  MapReduce

•  Today,
dozens
of
interrelated
components

•  I/O

•  Processing

•  Specialty
Applica@ons

•  Conﬁgura@on

•  Workﬂow

7

HDFS

•  Distributed,
highly
fault-‐tolerant
ﬁlesystem

•  Op@mized
for
large
streaming
access
to
data

•  Based
on
Google
File
System

•  hSp://research.google.com/archive/gfs.html

8

Lots
of
Commodity
Machines

9
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

MapReduce
(MR)

•  Programming
paradigm

•  Batch
oriented,
not
real@me

•  Works
well
with
distributed
compu@ng

•  Lots
of
Java,
but
other
languages
supported

•  Based
on
Google’s
paper

•  hSp://research.google.com/archive/mapreduce.html

10

Under
the
Covers

11

You specify map() and
reduce() functions.

The framework does the
rest.

60

Apache
Hive

•  Abstrac@on
of
Hadoop’s
Java
API

•  HiveQL
“compiles”
down
to
MR

•  a
“SQL-‐like”
language

•  Eases
analysis
using
MapReduce

13

Apache
Hive
Metastore

•  Maps
HDFS
ﬁles
to
DB-‐like
resources

•  Databases

•  Tables

•  Column/ﬁeld
names,
data
types

•  Roles/users

•  InputFormat/OutputFormat

14

WHY
DO
WE
NEED
THIS?

But
wait…

15

17
I
am
not
a
SQL
wizard
by
any
means…

Super
Shady
SQL
Supplement

A
Simple
Rela@onal
Database

name
state
employer
year

Alex
Maryland
Cloudera
2013

Joey
Maryland
Cloudera
2011

Sean
Texas
Cloudera
2013

Paris
Maryland
AOL
2011

18

Interac@ng
with
Rela@onal
Data

name
state
employer
year

Alex
Maryland
Cloudera
2013

Joey
Maryland
Cloudera
2011

Sean
Texas
Cloudera
2013

Paris
Maryland
AOL
2011

19

SELECT
*
FROM
people;

Interac@ng
with
Rela@onal
Data

name
state
employer
year

Alex
Maryland
Cloudera
2013

Joey
Maryland
Cloudera
2011

Sean
Texas
Cloudera
2013

Paris
Maryland
AOL
2011

20

SELECT
*
FROM
people;

Reques@ng
Speciﬁc
Fields

name
state
employer
year

Alex
Maryland
Cloudera
2013

Joey
Maryland
Cloudera
2011

Sean
Texas
Cloudera
2013

Paris
Maryland
AOL
2011

21

SELECT
name,
state
FROM
people;

Reques@ng
Speciﬁc
Fields

name
state
employer
year

Alex
Maryland
Cloudera
2013

Joey
Maryland
Cloudera
2011

Sean
Texas
Cloudera
2013

Paris
Maryland
AOL
2011

22

SELECT
name,
state
FROM
people;

Reques@ng
Speciﬁc
Rows

name
state
employer
year

Alex
Maryland
Cloudera
2013

Joey
Maryland
Cloudera
2011

Sean
Texas
Cloudera
2013

Paris
Maryland
AOL
2011

23

SELECT
name,
state
FROM
people
WHERE
year

2012;

Reques@ng
Speciﬁc
Rows

name
state
employer
year

Alex
Maryland
Cloudera
2013

Joey
Maryland
Cloudera
2011

Sean
Texas
Cloudera
2013

Paris
Maryland
AOL
2011

24

SELECT
name,
state
FROM
people
WHERE
year

2012;

Two
Simple
Tables

owner
species
name

Alex
Cactus
Marvin

Joey
Cat
Brain

Sean
None

Paris
Unknown

25

name
state
employer
year

Alex
Maryland
Cloudera
2013

Joey
Maryland
Cloudera
2011

Sean
Texas
Cloudera
2013

Paris
Maryland
AOL
2011

Joining
Two
Tables

owner
species
name

Alex
Cactus
Marvin

Joey
Cat
Brain

Sean
None

Paris
Unknown

26

SELECT
people.name
AS
owner,
people.state
AS
state,
pets.name
AS
pet

FROM
people
LEFT
JOIN
pets
ON
people.name
=
pets.owner

name
state
employer
year

Alex
Maryland
Cloudera
2013

Joey
Maryland
Cloudera
2011

Sean
Texas
Cloudera
2013

Paris
Maryland
AOL
2011

Joining
Two
Tables

owner
species
name

Alex
Cactus
Marvin

Joey
Cat
Brain

Sean
None

Paris
Unknown

27

SELECT
people.name
AS
owner,
people.state
AS
state,
pets.name
AS
pet

FROM
people
LEFT
JOIN
pets
ON
people.name
=
pets.owner

name
state
employer
year

Alex
Maryland
Cloudera
2013

Joey
Maryland
Cloudera
2011

Sean
Texas
Cloudera
2013

Paris
Maryland
AOL
2011

Joining
Two
Tables

owner
species
name

Alex
Cactus
Marvin

Joey
Cat
Brain

Sean
None

Paris
Unknown

28

SELECT
people.name
AS
owner,
people.state
AS
state,
pets.name
AS
pet

FROM
people
LEFT
JOIN
pets
ON
people.name
=
pets.owner

name
state
employer
year

Alex
Maryland
Cloudera
2013

Joey
Maryland
Cloudera
2011

Sean
Texas
Cloudera
2013

Paris
Maryland
AOL
2011

Joining
Two
Tables

29

SELECT
people.name
AS
owner,
people.state
AS
state,
pets.name
AS
pet

FROM
people
LEFT
JOIN
pets
ON
people.name
=
pets.owner

owner
state
pet

Alex
Maryland
Marvin

Joey
Maryland
Brain

Sean
Texas

Paris
Maryland

Varying
Implementa@on
of
JOIN

30

SELECT
people.name
AS
owner,
people.state
AS
state,
pets.name
AS
pet

FROM
people
LEFT
JOIN
pets
ON
people.name
=
pets.owner

owner
state
pet

Alex
Maryland
Marvin

Joey
Maryland
Brain

Sean
Texas
?

Paris
Maryland
?

31
Familiar
interface,
but
more
powerful.

Cloudera
Impala

Cloudera
Impala

•  Interac@ve
query
on
Hadoop

•  think
seconds,
not
minutes

•  Nearly
ANSI-‐92
standard
SQL

•  compa@ble
with
HiveQL

•  Na@ve
MPP
query
engine

•  built
for
low-‐latency
queries

32

Cloudera
Impala
–
Design
Choices

•  Na@ve
daemons,
wriSen
in
C/C++

•  No
JVM,
no
MapReduce

•  Saturate
disks
on
reads

•  Uses
in-‐memory
HDFS
caching

•  Re-‐uses
Hive
metastore

•  Not
as
fault-‐tolerant
as
MapReduce

33

Cloudera
Impala
–
Architecture

•  Impala
Daemon

•  runs
on
every
node

•  handles
client
requests

•  handles
query
planning

execu@on

•  State
Store
Daemon

•  provides
name
service

•  metadata
distribu@on

•  used
for
ﬁnding
data

34

Impala
Query
Execu@on

35
Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

SQL
App

ODBC

Hive

Metastore

HDFS
NN
Statestore

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

SQL
request

1)
Request
arrives
via
ODBC/JDBC/HUE/Shell

Impala
Query
Execu@on

36
Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

SQL
App

ODBC

Hive

Metastore

HDFS
NN
Statestore

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

2)
Planner
turns
request
into
collecRons
of
plan
fragments

3)
Coordinator
iniRates
execuRon
on
impalad(s)
local
to
data

Impala
Query
Execu@on

37
Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

SQL
App

ODBC

Hive

Metastore

HDFS
NN
Statestore

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

Query
Planner

Query
Coordinator

Query
Executor

HDFS
DN
HBase

4)
Intermediate
results
are
streamed
between
impalad(s)

5)
Query
results
are
streamed
back
to
client

Query
results

Cloudera
Impala
–
Results

•  Allows
for
fast
itera@on/discovery

•  How
much
faster?

•  3-‐4x
faster
on
I/O
bound
workloads

•  up
to
45x
faster
on
mul@-‐MR
queries

•  up
to
90x
faster
on
in-‐memory
cache

38

39
Hold
onto
something,
folks.

Demo

What’s
Next?

•  Download
Hadoop!

•  CDH
available
at
www.cloudera.com

•  Already
done
that?
Contribute…

•  Cloudera
provides
pre-‐loaded
VMs

•  hSp://@ny.cloudera.com/quickstartvm

•  Clone
our
repos!

•  hSps://github.com/cloudera

40

PARIS

Special
thanks:

41

42
Preferably
related
to
the
talk…
or
not.

Ques@ons?

43
Thank
You!

Alex
Moundalexis

@technmsg

We’re
hiring,
kids!
Well,
not
kids.

�ݺ�ߣ

Introduction to Cloudera Impala

More Related Content

Introduction to Cloudera Impala