ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Hadoop Virtualization Extensions

Junping Du


Sr.MTS, VMware, Inc




                                   ? 2009 VMware Inc. All rights reserved
Project HVE (Hadoop Virtualization Extensions)


? Refine Hadoop for running on virtualized infrastructure
    ? Enable multiple-layer network topology
    ? Enable resource sharing
    ? Enable compute/data node separation without losing locality


? Patches are contributed back to Apache Hadoop Community
    ? http://www.vmware.com/hadoop
    ? Umbrella JIRA: HADOOP-8468
    ? Sub JIRAs: HADOOP-8469, HADOOP-8470, HADOOP-8817, HDFS-3495,
                HDFS-3498, HDFS-3461, MAPREDUCE-4660, YARN-18, etc.




2
Current Network Topology

                                /



                D1                             D1
                                                               ? D = data center
                                                               ? R = rack
      R1              R2                 R3          R4
                                                               ? H = host

 H1   H2   H3    H4   H5   H6       H7   H8   H9 H10 H11 H12



However, you have more choices on
 virtualized infrastructure


            ? C = compute node
                (TaskTracker)
            ? D = data node

 3
High Level View on HVE changes




4
Additional network topology layer to aware virtuliazation

                                                                        ? D = data center
                                                                        ? R = rack
                                                                        ? NG = node group
                                              /                         ? HG = node


                       D1                                          D2



           R1                     R2                   R3                     R4



     NG1         NG2        NG3         NG4       NG5        NG6        NG7        NG8



N1   N2     N3     N4       N5     N6     N7      N8        N9   N10    N11    N12    N13


5
¡°Virtualization Aware¡± Replica Placement Policy


                                           Updated Policies:
                                           ? No replicas are placed on the
                                             same node or nodes under
                                             the same node group
                                           ? 1st replica is on the local
                                             node or one of nodes under
                                             the same node group of the
                                             writer
                                           ? 2nd replica is on a remote
                                             rack of the 1st replica
                                           ? 3rd replica is on the same
                                             rack as the 2nd replica
                                           ? Remaining replicas are
                                             placed randomly across rack
                                             to meet minimum restriction.



6
¡°Virtualization Aware¡± Replica Choosing Policy


                                           Distances for data locality:
                                           ? Node local (0)
                                           ? Node group local (2)
                                           ? Rack local (4)
                                           ? Off rack (6)




7
¡°Virtualization Aware¡± Balancer Policy


                                ? Balancer policies contains two levels
                                  choosing policy
                                   - choosing node pairs of source and
                                  target, in sequence of: local node group,
                                  local rack, off rack
                                   - choosing blocks to move within node
                                  pair, a replica block is not a good
                                  candidate if another replica is on the
                                  target node or on the same node group
                                  of the target node




8
¡°Virtualization Aware¡± Task Scheduling Policy


                                          Get task split for TaskTracker or
                                           NodeManager in following
                                           sequences:
                                          ? Node local
                                          ? Node group local
                                          ? Rack local
                                          ? Off rack


                                          It works well with
                                          ? FifoScheduler
                                          ? FairScheduler
                                          ? Capacity scheduler




9
HVE Effects on Reliability and Performance




10
Summary

? Hadoop Virtualization Extensions
 ? Network Topology with additional layer
 ? Replica placement/removal/choosing policies extension
 ? Balancer policy extension
 ? Task Scheduling policy extension
? HVE effect
 ? Reliability ¨C multiple DN VMs per host
 ? Performance ¨C DN/CN separation case




11
References

? Hadoop at VMware
 ? www.vmware.com/hadoop
? Project Serengeti
 ? projectserengeti.org
? Umbrella JIRA for HVE
 ? https://issues.apache.org/jira/browse/HADOOP-8468                   Serengeti
? Hadoop on vSphere
 ? Talks @ Hadoop World, Hadoop Summit
 ? White Papers
? Spring for Apache Hadoop
 ? http://blog.springsource.org/2012/02/29/introducing-spring-hadoop




12
Q&A

     Thank you!




13

More Related Content

Hadoop virtualization extensions hadoop world meetup

  • 1. Hadoop Virtualization Extensions Junping Du Sr.MTS, VMware, Inc ? 2009 VMware Inc. All rights reserved
  • 2. Project HVE (Hadoop Virtualization Extensions) ? Refine Hadoop for running on virtualized infrastructure ? Enable multiple-layer network topology ? Enable resource sharing ? Enable compute/data node separation without losing locality ? Patches are contributed back to Apache Hadoop Community ? http://www.vmware.com/hadoop ? Umbrella JIRA: HADOOP-8468 ? Sub JIRAs: HADOOP-8469, HADOOP-8470, HADOOP-8817, HDFS-3495, HDFS-3498, HDFS-3461, MAPREDUCE-4660, YARN-18, etc. 2
  • 3. Current Network Topology / D1 D1 ? D = data center ? R = rack R1 R2 R3 R4 ? H = host H1 H2 H3 H4 H5 H6 H7 H8 H9 H10 H11 H12 However, you have more choices on virtualized infrastructure ? C = compute node (TaskTracker) ? D = data node 3
  • 4. High Level View on HVE changes 4
  • 5. Additional network topology layer to aware virtuliazation ? D = data center ? R = rack ? NG = node group / ? HG = node D1 D2 R1 R2 R3 R4 NG1 NG2 NG3 NG4 NG5 NG6 NG7 NG8 N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N12 N13 5
  • 6. ¡°Virtualization Aware¡± Replica Placement Policy Updated Policies: ? No replicas are placed on the same node or nodes under the same node group ? 1st replica is on the local node or one of nodes under the same node group of the writer ? 2nd replica is on a remote rack of the 1st replica ? 3rd replica is on the same rack as the 2nd replica ? Remaining replicas are placed randomly across rack to meet minimum restriction. 6
  • 7. ¡°Virtualization Aware¡± Replica Choosing Policy Distances for data locality: ? Node local (0) ? Node group local (2) ? Rack local (4) ? Off rack (6) 7
  • 8. ¡°Virtualization Aware¡± Balancer Policy ? Balancer policies contains two levels choosing policy - choosing node pairs of source and target, in sequence of: local node group, local rack, off rack - choosing blocks to move within node pair, a replica block is not a good candidate if another replica is on the target node or on the same node group of the target node 8
  • 9. ¡°Virtualization Aware¡± Task Scheduling Policy Get task split for TaskTracker or NodeManager in following sequences: ? Node local ? Node group local ? Rack local ? Off rack It works well with ? FifoScheduler ? FairScheduler ? Capacity scheduler 9
  • 10. HVE Effects on Reliability and Performance 10
  • 11. Summary ? Hadoop Virtualization Extensions ? Network Topology with additional layer ? Replica placement/removal/choosing policies extension ? Balancer policy extension ? Task Scheduling policy extension ? HVE effect ? Reliability ¨C multiple DN VMs per host ? Performance ¨C DN/CN separation case 11
  • 12. References ? Hadoop at VMware ? www.vmware.com/hadoop ? Project Serengeti ? projectserengeti.org ? Umbrella JIRA for HVE ? https://issues.apache.org/jira/browse/HADOOP-8468 Serengeti ? Hadoop on vSphere ? Talks @ Hadoop World, Hadoop Summit ? White Papers ? Spring for Apache Hadoop ? http://blog.springsource.org/2012/02/29/introducing-spring-hadoop 12
  • 13. Q&A Thank you! 13