際際滷

際際滷Share a Scribd company logo
Bandwidth, Throughput, IOPS, and FLOPS  The Delicate Balance Bill Menger, Dan Wieder, Dave Glover  ConocoPhillips March 5, 2009
Outline The Problem   Provide enough computer to keep your  customers happy (at low cost) The Solution   Estimating needs and looking at options The Kickoff   Never exactly works like you thought The Tuning   Bring it into the Groove one way or the other The Task   Get er done! The Finale   Evaluating the good, the bad, the ugly, and the  beautiful
The Problem We all have tasks to do in not enough time Solve your customers problems with an eye  to the future Dont box yourself into a corner (if you can help it).
Daves Definition of a Supercomputer A supercomputer is a computer that is only one order of magnitude slower than what scientists and engineers need to solve their current problems. This definition is timeless and assures me job security. Dave Glover
Solution Task-Driven Computer Design Get a bunch of Geophysicists in a room, and ask: What is the overall project you need to accomplish? How many tasks (jobs) must be run in order to complete? How long does each job take on architecture a,b,c? So buy nnn of architecture xxx, back up your timeline so you can complete on time, and begin! (oh, if it were that easy) The FLOPS We always start here  How much horsepower needed to do the job? The Bandwidth Seems like this is the next step  MPI needs a good backbone! The Throughput Always want a disk system to jump in with all cylinders firing. The IOPS/Sec Never really thought about this before now.  Maybe I should have!
Task Driven Design  Case Study We get what we need no more, no less The Need: 2 iterations 30,000 jobs/iteration ~200 cpu-hours/job 12MM cpu-hours in 50 days
Task Driven Design  Case Study A solution Run each job on 8 cores to get jobs down to 1 day Make it easy to load/unload data for jobs with global parallel file system Put enough memory into each computer to hold the job Keep enough local disk to store intermediate results Buy enough computers to make it happen
Options Intel/AMD/Cell Infiniband/GigE/10GigE/Myrinet/Special NFS disk/Directly connected Disk (local client software) Specialty hardware
Considerations Fit within current network infrastructure Think about my Reservoir Engineer and CFD customers What about future algorithms/approaches?
The Chosen Design 1250 Dual Socket, Quad-core AMD Opterons 10GigE card in each node with iWarp Global Parallel Direct-connected disk system (client modules on each node) 10GigE interconnect everywhere
Ideal Network Configuration compute compute compute compute 1Gbit Misc Switch Edge switch Edge switch Edge switch Edge switch storage storage storage Core Switch Core Switch 10Gbit Central Switch 10Gbit Central Switch 10Gbit Central Switch 10Gbit Central Switch
1250 Rackable nodes, dual socket quad core AMD Opteron 2.3Ghz, 32Gbytes DRAM 1250 Neteffect(Intel) 10Gbit cards with iWarp low-latency software 42 Force-10 5210 -- 10Gbit 48 port edge switches 2 Foundry MLX-8 core switches (16 10 Gbit ports each) 4 Foundry MLX-32 core switches (128 10Gig ports each) 16 shelves Panasas Storage  The Chosen Components
160 Terabytes of usable storage 160 Gbits connectivity to storage 2.5Gbit full bisection bandwidth <10usec latency 46 Teraflops estimated capability 400Kwatts total power draw when running Linpack 24 Racks including management nodes, switches, and storage The Specs
The Kickoff Jobs ran (as modified) in 12 hours.  Good! 10 Gig Network cards not available for all nodes.  Used 1Gig to nodes, 10 Gig from edge switches to core, and 10 Gig to disks. 1 data space on disk (huge file system)
A small gotcha Commercial software for processing logs back to common area lots of small-file initialization for each job always initiates MPI so many jobs hitting users login files all at once. Output file format (javaseis) Nice structured sparse format, good I/O for the actual data very small i/o to common area associated with file updates.
14:00  22:00 Disk MBytes/Sec
14:00  22:00 Disk I/O Ops/sec
16:30  17:00 Disk Mbytes/Sec
16:30  17:00 Disk I/O Ops/Sec
Consequences Start up 1000 jobs in the commercial system, and watch it take 5 seconds per job to enter the queue.  Thats 1.4 hours! Jobs all reading at once, lots of i/o contention on /home and another directory On completion, jobs all writing small-block i/o in the common area, all at the same time. (requires read-modify-write for each operation)
Getting into the Groove First, we decided to stagger jobs.  Let the auto-job builder add in a delay within each job, the delay means the jobs will start and sleep for linearly increasing times, causing a nice, steady, I/O pattern. Subsequent jobs are simply queued and the staggered effect persists. Secondly, we added a smaller data directory to house the common area.  Now we have two file systems, one for large-block I/O and one for small.
~1200 Jobs, ~50,000 IOps/Sec
Execution Jobs ran well, I/O held to a maintainable level 1Gbit network sufficed Finished ahead of schedule
Smooth Sailing!
The good, bad, ugly, and beautiful Good The job got done Bad Frantic startup problems with commercial system, Disk solution Ugly Realizing we undersized the IOPS Beautiful System is ready for more work with forwardly upgradable network at the core.
Lessons Learned Dont scrimp on ANY of the following Network Infrastructure (10+Gbits/s <10nsec latency) I/O Bandwidth (>2500Mbytes/sec) Memory (>= 4Gbytes/core) I/O Ops/second (>100,000 iops) Vendor software is out of your control (deal with it)
Ad

More Related Content

What's hot (6)

Maximize Your Production Effort (English)
Maximize Your Production Effort (English)Maximize Your Production Effort (English)
Maximize Your Production Effort (English)
slantsixgames
Automation with Puppet and a Path to Private Hybrid Cloud
Automation with Puppet and a Path to Private Hybrid CloudAutomation with Puppet and a Path to Private Hybrid Cloud
Automation with Puppet and a Path to Private Hybrid Cloud
Andrew Ludwar
Working Well Together: How to Keep High-end Game Development Teams Productive
Working Well Together: How to Keep High-end Game Development Teams ProductiveWorking Well Together: How to Keep High-end Game Development Teams Productive
Working Well Together: How to Keep High-end Game Development Teams Productive
Perforce
Optimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant callerOptimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant caller
AllineaSoftware
Followup Session in Asia-Pacific Geant4 Workshop and Training Course 2009 hel...
Followup Session in Asia-Pacific Geant4 Workshop and Training Course 2009 hel...Followup Session in Asia-Pacific Geant4 Workshop and Training Course 2009 hel...
Followup Session in Asia-Pacific Geant4 Workshop and Training Course 2009 hel...
Go Iwai
Stackato v6
Stackato v6Stackato v6
Stackato v6
Jonas Br淡ms淡
Maximize Your Production Effort (English)
Maximize Your Production Effort (English)Maximize Your Production Effort (English)
Maximize Your Production Effort (English)
slantsixgames
Automation with Puppet and a Path to Private Hybrid Cloud
Automation with Puppet and a Path to Private Hybrid CloudAutomation with Puppet and a Path to Private Hybrid Cloud
Automation with Puppet and a Path to Private Hybrid Cloud
Andrew Ludwar
Working Well Together: How to Keep High-end Game Development Teams Productive
Working Well Together: How to Keep High-end Game Development Teams ProductiveWorking Well Together: How to Keep High-end Game Development Teams Productive
Working Well Together: How to Keep High-end Game Development Teams Productive
Perforce
Optimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant callerOptimizing thread performance for a genomics variant caller
Optimizing thread performance for a genomics variant caller
AllineaSoftware
Followup Session in Asia-Pacific Geant4 Workshop and Training Course 2009 hel...
Followup Session in Asia-Pacific Geant4 Workshop and Training Course 2009 hel...Followup Session in Asia-Pacific Geant4 Workshop and Training Course 2009 hel...
Followup Session in Asia-Pacific Geant4 Workshop and Training Course 2009 hel...
Go Iwai

Similar to Bandwidth, Throughput, Iops, And Flops (20)

14 scaleabilty wics
14 scaleabilty wics14 scaleabilty wics
14 scaleabilty wics
ashish61_scs
Micro Servers in Big Data
Micro Servers in Big DataMicro Servers in Big Data
Micro Servers in Big Data
Aater Suleman
Designs, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDesigns, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed Systems
Daehyeok Kim
Challenges with high density networks
Challenges with high density networksChallenges with high density networks
Challenges with high density networks
Marian Marinov
Software Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
Software Engineering Advice from Google's Jeff Dean for Big, Distributed SystemsSoftware Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
Software Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
adrianionel
IBM Blade University: Emulex Connects the Data Center of the Future
IBM Blade University: Emulex Connects the Data Center of the FutureIBM Blade University: Emulex Connects the Data Center of the Future
IBM Blade University: Emulex Connects the Data Center of the Future
Emulex Corporation
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
brouer
Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...
Joshua Mora
3V0-622 objective-3.1-logical-physical with Joe Clarke @elgwhoppo
3V0-622 objective-3.1-logical-physical with Joe Clarke @elgwhoppo3V0-622 objective-3.1-logical-physical with Joe Clarke @elgwhoppo
3V0-622 objective-3.1-logical-physical with Joe Clarke @elgwhoppo
Joe Clarke
Open stack in action cern _openstack_accelerating_science
Open stack in action  cern _openstack_accelerating_scienceOpen stack in action  cern _openstack_accelerating_science
Open stack in action cern _openstack_accelerating_science
eNovance
The $1000 Internet Exchange
The $1000 Internet Exchange The $1000 Internet Exchange
The $1000 Internet Exchange
Remco van Mook
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for science
Speck&Tech
Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012
Benoit Hudzia
Latency vs everything
Latency vs everythingLatency vs everything
Latency vs everything
Ori Pekelman
High Performance Cloud Computing
High Performance Cloud ComputingHigh Performance Cloud Computing
High Performance Cloud Computing
Deepak Singh
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
Wilhelm van Belkum
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
DataWorks Summit
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
AMD Developer Central
Vnx series-technical-review-110616214632-phpapp02
Vnx series-technical-review-110616214632-phpapp02Vnx series-technical-review-110616214632-phpapp02
Vnx series-technical-review-110616214632-phpapp02
Newlink
Vnx series-technical-review-110616214632-phpapp02
Vnx series-technical-review-110616214632-phpapp02Vnx series-technical-review-110616214632-phpapp02
Vnx series-technical-review-110616214632-phpapp02
Newlink
14 scaleabilty wics
14 scaleabilty wics14 scaleabilty wics
14 scaleabilty wics
ashish61_scs
Micro Servers in Big Data
Micro Servers in Big DataMicro Servers in Big Data
Micro Servers in Big Data
Aater Suleman
Designs, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDesigns, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed Systems
Daehyeok Kim
Challenges with high density networks
Challenges with high density networksChallenges with high density networks
Challenges with high density networks
Marian Marinov
Software Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
Software Engineering Advice from Google's Jeff Dean for Big, Distributed SystemsSoftware Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
Software Engineering Advice from Google's Jeff Dean for Big, Distributed Systems
adrianionel
IBM Blade University: Emulex Connects the Data Center of the Future
IBM Blade University: Emulex Connects the Data Center of the FutureIBM Blade University: Emulex Connects the Data Center of the Future
IBM Blade University: Emulex Connects the Data Center of the Future
Emulex Corporation
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running LinuxLinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
brouer
Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...
Joshua Mora
3V0-622 objective-3.1-logical-physical with Joe Clarke @elgwhoppo
3V0-622 objective-3.1-logical-physical with Joe Clarke @elgwhoppo3V0-622 objective-3.1-logical-physical with Joe Clarke @elgwhoppo
3V0-622 objective-3.1-logical-physical with Joe Clarke @elgwhoppo
Joe Clarke
Open stack in action cern _openstack_accelerating_science
Open stack in action  cern _openstack_accelerating_scienceOpen stack in action  cern _openstack_accelerating_science
Open stack in action cern _openstack_accelerating_science
eNovance
The $1000 Internet Exchange
The $1000 Internet Exchange The $1000 Internet Exchange
The $1000 Internet Exchange
Remco van Mook
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for science
Speck&Tech
Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012
Benoit Hudzia
Latency vs everything
Latency vs everythingLatency vs everything
Latency vs everything
Ori Pekelman
High Performance Cloud Computing
High Performance Cloud ComputingHigh Performance Cloud Computing
High Performance Cloud Computing
Deepak Singh
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
DataWorks Summit
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
AMD Developer Central
Vnx series-technical-review-110616214632-phpapp02
Vnx series-technical-review-110616214632-phpapp02Vnx series-technical-review-110616214632-phpapp02
Vnx series-technical-review-110616214632-phpapp02
Newlink
Vnx series-technical-review-110616214632-phpapp02
Vnx series-technical-review-110616214632-phpapp02Vnx series-technical-review-110616214632-phpapp02
Vnx series-technical-review-110616214632-phpapp02
Newlink
Ad

Bandwidth, Throughput, Iops, And Flops

  • 1. Bandwidth, Throughput, IOPS, and FLOPS The Delicate Balance Bill Menger, Dan Wieder, Dave Glover ConocoPhillips March 5, 2009
  • 2. Outline The Problem Provide enough computer to keep your customers happy (at low cost) The Solution Estimating needs and looking at options The Kickoff Never exactly works like you thought The Tuning Bring it into the Groove one way or the other The Task Get er done! The Finale Evaluating the good, the bad, the ugly, and the beautiful
  • 3. The Problem We all have tasks to do in not enough time Solve your customers problems with an eye to the future Dont box yourself into a corner (if you can help it).
  • 4. Daves Definition of a Supercomputer A supercomputer is a computer that is only one order of magnitude slower than what scientists and engineers need to solve their current problems. This definition is timeless and assures me job security. Dave Glover
  • 5. Solution Task-Driven Computer Design Get a bunch of Geophysicists in a room, and ask: What is the overall project you need to accomplish? How many tasks (jobs) must be run in order to complete? How long does each job take on architecture a,b,c? So buy nnn of architecture xxx, back up your timeline so you can complete on time, and begin! (oh, if it were that easy) The FLOPS We always start here How much horsepower needed to do the job? The Bandwidth Seems like this is the next step MPI needs a good backbone! The Throughput Always want a disk system to jump in with all cylinders firing. The IOPS/Sec Never really thought about this before now. Maybe I should have!
  • 6. Task Driven Design Case Study We get what we need no more, no less The Need: 2 iterations 30,000 jobs/iteration ~200 cpu-hours/job 12MM cpu-hours in 50 days
  • 7. Task Driven Design Case Study A solution Run each job on 8 cores to get jobs down to 1 day Make it easy to load/unload data for jobs with global parallel file system Put enough memory into each computer to hold the job Keep enough local disk to store intermediate results Buy enough computers to make it happen
  • 8. Options Intel/AMD/Cell Infiniband/GigE/10GigE/Myrinet/Special NFS disk/Directly connected Disk (local client software) Specialty hardware
  • 9. Considerations Fit within current network infrastructure Think about my Reservoir Engineer and CFD customers What about future algorithms/approaches?
  • 10. The Chosen Design 1250 Dual Socket, Quad-core AMD Opterons 10GigE card in each node with iWarp Global Parallel Direct-connected disk system (client modules on each node) 10GigE interconnect everywhere
  • 11. Ideal Network Configuration compute compute compute compute 1Gbit Misc Switch Edge switch Edge switch Edge switch Edge switch storage storage storage Core Switch Core Switch 10Gbit Central Switch 10Gbit Central Switch 10Gbit Central Switch 10Gbit Central Switch
  • 12. 1250 Rackable nodes, dual socket quad core AMD Opteron 2.3Ghz, 32Gbytes DRAM 1250 Neteffect(Intel) 10Gbit cards with iWarp low-latency software 42 Force-10 5210 -- 10Gbit 48 port edge switches 2 Foundry MLX-8 core switches (16 10 Gbit ports each) 4 Foundry MLX-32 core switches (128 10Gig ports each) 16 shelves Panasas Storage The Chosen Components
  • 13. 160 Terabytes of usable storage 160 Gbits connectivity to storage 2.5Gbit full bisection bandwidth <10usec latency 46 Teraflops estimated capability 400Kwatts total power draw when running Linpack 24 Racks including management nodes, switches, and storage The Specs
  • 14. The Kickoff Jobs ran (as modified) in 12 hours. Good! 10 Gig Network cards not available for all nodes. Used 1Gig to nodes, 10 Gig from edge switches to core, and 10 Gig to disks. 1 data space on disk (huge file system)
  • 15. A small gotcha Commercial software for processing logs back to common area lots of small-file initialization for each job always initiates MPI so many jobs hitting users login files all at once. Output file format (javaseis) Nice structured sparse format, good I/O for the actual data very small i/o to common area associated with file updates.
  • 16. 14:00 22:00 Disk MBytes/Sec
  • 17. 14:00 22:00 Disk I/O Ops/sec
  • 18. 16:30 17:00 Disk Mbytes/Sec
  • 19. 16:30 17:00 Disk I/O Ops/Sec
  • 20. Consequences Start up 1000 jobs in the commercial system, and watch it take 5 seconds per job to enter the queue. Thats 1.4 hours! Jobs all reading at once, lots of i/o contention on /home and another directory On completion, jobs all writing small-block i/o in the common area, all at the same time. (requires read-modify-write for each operation)
  • 21. Getting into the Groove First, we decided to stagger jobs. Let the auto-job builder add in a delay within each job, the delay means the jobs will start and sleep for linearly increasing times, causing a nice, steady, I/O pattern. Subsequent jobs are simply queued and the staggered effect persists. Secondly, we added a smaller data directory to house the common area. Now we have two file systems, one for large-block I/O and one for small.
  • 23. Execution Jobs ran well, I/O held to a maintainable level 1Gbit network sufficed Finished ahead of schedule
  • 25. The good, bad, ugly, and beautiful Good The job got done Bad Frantic startup problems with commercial system, Disk solution Ugly Realizing we undersized the IOPS Beautiful System is ready for more work with forwardly upgradable network at the core.
  • 26. Lessons Learned Dont scrimp on ANY of the following Network Infrastructure (10+Gbits/s <10nsec latency) I/O Bandwidth (>2500Mbytes/sec) Memory (>= 4Gbytes/core) I/O Ops/second (>100,000 iops) Vendor software is out of your control (deal with it)