�ݺ�ߣ

Big Data and Containers
Charles Smith
@charles_s_smith

Netflix / Lead the big data platform architecture team
Spend my time / Thinking how to make it easy/efficient to work with big data
University of Florida / PhD in Computer Science
Who am I?

“It is important that we know where we come from, because
if you do not know where you come from, then you don't
know where you are, and if you don't know where you are,
you don't know where you're going. And if you don't know
where you're going, you're probably going wrong.”
Terry Pratchett

Database Distributed Database Distributed Storage
Distributed Processing
???

Why do we care about containers?

Containers ~= Virtual Machines
Virtual Machines ~= Servers

Lightweight
fast to start
memory use
Secure
Process isolation
Data isolation
Portable
Composable
Reproducible
Everything old is new

Microservices and large architectures

Datastorage
(Cassandra, MySQL, MongoDB, etc..)

Operational
(Mesos, Kubernetes, etc...)

What’s different about big data.

Customer Facing
Minimize latency
Maximize reliability

Data Analytics
Minimize I/O
Maximize processing

The questions you can answer aren’t predefined

Hive/Pig/MR
Presto
Metacat
Hive
Metastore

That doesn’t look very container-y
(or microservicy-y for that matter)

Datastorage - HDFS (Or in our case S3)

So what happens when you want to do something else?

But is that really the way we want to approach containers?

Running many different short-lived processes

Running many different short-lived processes
Efficient container construction, allocation, and movement

Groups of processes having meaning

Groups of processes having meaning
How we observe processes needs to be holistic

Processes need to be scheduled by data locality
(And not just data locality for data at rest)

Processes need to be scheduled by data locality
(And not just data locality for data at rest)
A special case of affinity (although possibly over time)
but...

We do need a data discovery service.
(kind of… maybe… a namenode?)

SELECT
t.title_id,
t.title_desc,
SUM(v.view_secs)
FROM
view_history as v
join title_d as t on
v.title_id =
t.title_id
WHERE
v.view_dateint > 20150101
GROUP BY 1,2;
LOAD LOAD
JOIN
GROUP

Data
Discovery
Query Compiler
Query Planner
Metadata
DAG
Watcher

Bottom line
Containers provide process level security
The goal should be to minimize monoliths
This isn’t different from what we are doing already
Our languages are abstractions of composable-distributed processing
Different big data projects should share services
No matter what we do, joining is going to be a big problem

�ݺ�ߣ

Big data and containers

More Related Content

Big data and containers

Editor's Notes