This document discusses how containers can be used for big data workloads. It notes that containers provide lightweight virtualization similar to virtual machines. The document outlines how containers can help with distributed processing and storage of big data. It discusses using containers to ship computation to data and schedule processes based on data locality. Overall, the document argues that containers are well-suited for big data applications by allowing distributed, short-lived processes to be run efficiently near related data.
1 of 37
Downloaded 41 times
More Related Content
Big data and containers
1. Big Data and Containers
Charles Smith
@charles_s_smith
2. Netflix / Lead the big data platform architecture team
Spend my time / Thinking how to make it easy/efficient to work with big data
University of Florida / PhD in Computer Science
Who am I?
3. It is important that we know where we come from, because
if you do not know where you come from, then you don't
know where you are, and if you don't know where you are,
you don't know where you're going. And if you don't know
where you're going, you're probably going wrong.
Terry Pratchett
30. Groups of processes having meaning
How we observe processes needs to be holistic
31. Processes need to be scheduled by data locality
(And not just data locality for data at rest)
32. Processes need to be scheduled by data locality
(And not just data locality for data at rest)
A special case of affinity (although possibly over time)
but...
33. We do need a data discovery service.
(kind of maybe a namenode?)
36. Bottom line
Containers provide process level security
The goal should be to minimize monoliths
This isnt different from what we are doing already
Our languages are abstractions of composable-distributed processing
Different big data projects should share services
No matter what we do, joining is going to be a big problem
#14: Think of it this way:
Our content is data at rest, a bunch of encodings sitting on an open connect server somewhere.
When someone wants to view something, the data is streamed to them, data in motion (for a huge chunk of the downstream bandwidth)
And the actual viewing of the content is the visualization of the data.
You can extend this pattern to other services. Dont go overboard, but it is a useful way to think about data. Especially when the data starts to get big.