際際滷

際際滷Share a Scribd company logo
Big Data and Containers
Charles Smith
@charles_s_smith
Netflix / Lead the big data platform architecture team
Spend my time / Thinking how to make it easy/efficient to work with big data
University of Florida / PhD in Computer Science
Who am I?
It is important that we know where we come from, because
if you do not know where you come from, then you don't
know where you are, and if you don't know where you are,
you don't know where you're going. And if you don't know
where you're going, you're probably going wrong.
Terry Pratchett
Database Distributed Database Distributed Storage
Distributed Processing
???
Why do we care about containers?
Containers ~= Virtual Machines
Virtual Machines ~= Servers
Lightweight
fast to start
memory use
Secure
Process isolation
Data isolation
Portable
Composable
Reproducible
Everything old is new
Microservices and large architectures
Datastorage
(Cassandra, MySQL, MongoDB, etc..)
Operational
(Mesos, Kubernetes, etc...)
Discovery/Routing
Whats different about big data.
Data at rest
Data in motion
Customer Facing
Minimize latency
Maximize reliability
Data Analytics
Minimize I/O
Maximize processing
Ship computation to data
The questions you can answer arent predefined
Hive/Pig/MR
Presto
Metacat
Hive
Metastore
That doesnt look very container-y
(or microservicy-y for that matter)
Datastorage - HDFS (Or in our case S3)
Operational - YARN
Containers - JVM
So what happens when you want to do something else?
Big data and containers
But is that really the way we want to approach containers?
Whats different about big data.
Running many different short-lived processes
Running many different short-lived processes
Efficient container construction, allocation, and movement
Groups of processes having meaning
Groups of processes having meaning
How we observe processes needs to be holistic
Processes need to be scheduled by data locality
(And not just data locality for data at rest)
Processes need to be scheduled by data locality
(And not just data locality for data at rest)
A special case of affinity (although possibly over time)
but...
We do need a data discovery service.
(kind of maybe a namenode?)
SELECT
t.title_id,
t.title_desc,
SUM(v.view_secs)
FROM
view_history as v
join title_d as t on
v.title_id =
t.title_id
WHERE
v.view_dateint > 20150101
GROUP BY 1,2;
LOAD LOAD
JOIN
GROUP
Data
Discovery
Query Compiler
Query Planner
Metadata
DAG
Watcher
Bottom line
Containers provide process level security
The goal should be to minimize monoliths
This isnt different from what we are doing already
Our languages are abstractions of composable-distributed processing
Different big data projects should share services
No matter what we do, joining is going to be a big problem
Questions?

More Related Content

Big data and containers

Editor's Notes

  • #8: This is a good thing!
  • #9: Something that is ingrained at Netflix
  • #10: Decentralized
  • #11: Basically do I deploy and get resources?
  • #14: Think of it this way: Our content is data at rest, a bunch of encodings sitting on an open connect server somewhere. When someone wants to view something, the data is streamed to them, data in motion (for a huge chunk of the downstream bandwidth) And the actual viewing of the content is the visualization of the data. You can extend this pattern to other services. Dont go overboard, but it is a useful way to think about data. Especially when the data starts to get big.
  • #19: But that isnt really what we do.
  • #28: As a result the allocations need to be fast and scalable.
  • #29: As a result the allocations need to be fast and scalable.