The document discusses Hadoop Virtualization Extensions (HVE), a project that refines Hadoop to better support running on virtualized infrastructure. HVE adds additional network topology layers, extends replica placement and balancing policies to be aware of virtualization, and updates task scheduling policies. These changes aim to enable features like multiple data nodes per host, separation of compute and data nodes without losing data locality, and improved reliability and performance when Hadoop is run in a virtualized environment. Patches from the HVE project are contributed back to the Apache Hadoop community.
1 of 13
Downloaded 44 times
More Related Content
Hadoop virtualization extensions hadoop world meetup
2. Project HVE (Hadoop Virtualization Extensions)
? Refine Hadoop for running on virtualized infrastructure
? Enable multiple-layer network topology
? Enable resource sharing
? Enable compute/data node separation without losing locality
? Patches are contributed back to Apache Hadoop Community
? http://www.vmware.com/hadoop
? Umbrella JIRA: HADOOP-8468
? Sub JIRAs: HADOOP-8469, HADOOP-8470, HADOOP-8817, HDFS-3495,
HDFS-3498, HDFS-3461, MAPREDUCE-4660, YARN-18, etc.
2
3. Current Network Topology
/
D1 D1
? D = data center
? R = rack
R1 R2 R3 R4
? H = host
H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12
However, you have more choices on
virtualized infrastructure
? C = compute node
(TaskTracker)
? D = data node
3
5. Additional network topology layer to aware virtuliazation
? D = data center
? R = rack
? NG = node group
/ ? HG = node
D1 D2
R1 R2 R3 R4
NG1 NG2 NG3 NG4 NG5 NG6 NG7 NG8
N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N12 N13
5
6. ¡°Virtualization Aware¡± Replica Placement Policy
Updated Policies:
? No replicas are placed on the
same node or nodes under
the same node group
? 1st replica is on the local
node or one of nodes under
the same node group of the
writer
? 2nd replica is on a remote
rack of the 1st replica
? 3rd replica is on the same
rack as the 2nd replica
? Remaining replicas are
placed randomly across rack
to meet minimum restriction.
6
7. ¡°Virtualization Aware¡± Replica Choosing Policy
Distances for data locality:
? Node local (0)
? Node group local (2)
? Rack local (4)
? Off rack (6)
7
8. ¡°Virtualization Aware¡± Balancer Policy
? Balancer policies contains two levels
choosing policy
- choosing node pairs of source and
target, in sequence of: local node group,
local rack, off rack
- choosing blocks to move within node
pair, a replica block is not a good
candidate if another replica is on the
target node or on the same node group
of the target node
8
9. ¡°Virtualization Aware¡± Task Scheduling Policy
Get task split for TaskTracker or
NodeManager in following
sequences:
? Node local
? Node group local
? Rack local
? Off rack
It works well with
? FifoScheduler
? FairScheduler
? Capacity scheduler
9