際際滷

際際滷Share a Scribd company logo
RNP
Brazilian National Education and Research Network


ICT Support for Large-scale Science
5th J-PAS collaboration meeting / 11-Sep-2012


Leandro N. Ciuffo             Alex S. Moura
leandro.ciuffo@rnp.br         alex.moura@rnp.br


                                                    1
About RNP
Qualified as a non-profit Social Organization
(OS)
maintained by federal public resources of
  Government budget includes items to cover costs
   network and also RNP operating costs
  Support by MCTI, MEC and now MinC (Culture)
  RNP monitored by MCTI, CGU and TCU Nacional
                           INPE - Instituto
                             de Pesquisas Espaciais
  Additional projects supported by sectorial funds directly or
              LNCC - Laborat坦rio         INPA - Instituto Nacional
   through Nacional de Computa巽達o
           management contract, plus MS (Health) da
                                            de Pesquisas
                  Cient鱈fica
  Reserach Unit of the Ministry of S&T       Amaz担nia




                       http://www.mct.gov.br/index.php/content/view/741.html
Build and operate R&E networks
 Maintenance and continued renewal of infrastructure
   RNP backbone of 2000 has been renewed 3 times (2004, 2005 and
    2011), with large increases in maximum link capacity from 25 Mbps to 10
    Gbps (factor of 400)
   Metro networks have been built in capital cities to provide access to
    Point of Presence (PoP) at 1 Gbps or more
    www.redecomep.rnp.br
   International capacity has increased from 300 Mbps in 2002 to over 20
    Gbps since 2009 (factor of 70). RNP has also played a major role in building
    the RedCLARA (Latin American regional) network, linking R&E networks from
    more than 12 countries  www.redclara.net
   Testbed networks for network experimentation, especially
    project GIGA (with CPqD) since 2003 and the EU-BR FIBRE project (2011-
    2014)
Ip棚 Network
RNP Backbone            Boa Vista
                                       Macap叩


                                                                 Fortaleza
                           Manaus




                                                                  Salvador
                                         Brasilia




                                     S達o Paulo
                                                           Rio de Janeiro

  Bandwidth                                                  Commodity Internet
                                                             RedCLARA (to Europe)
  Minimum: 20Mbps                         酷鉛看姻庄温稼坦沿看鉛庄壊     Americas Light (to USA)
  Maximum..: 10Gbps
                                     Porto Alegre
  Aggregated: 250Gbps


                                    http://www.rnp.br/backbone/
MCTI Roadmap
2012-2015




            http://www.mcti.gov.br/index.php/content/view/335668.html
Metro
 Networks
 23 cities operational
 6 under deployment
 13 planned
 1980 Km

             to IP network

   Institution A    RNP
                    PoP
   Institution B


   Institution C


    http://redecomep.rnp.br
RNP 5th J-PAS 11-Nov-2012
S達o Paulo
酷鉛看姻庄温稼坦沿看鉛庄壊
Why your data-driven research
     is relevant to RNP?
Science paradigms evolution

 Data-intensive research
 unify theory, experiment and simulation at scale. big data


 Computational Simulations
 simulating complex phenomena. in silico


 Theoretical Modeling
 e,g, Kepler's and Newtonw卒s laws


 Empirical Science
 describing natural phenomena
Key components of a new research
infrastructure
                          Scientific portal
         Local
   services and
     workflows                                                   Big Data
                                                               Processing
                    Registering                  Discovering


       Publishers
                                    Harvesting                     Users
                                    & Indexing

                       Publishing           Information
                                            Visualization

                         Data
                         Repositories

 Instruments
RNP 5th J-PAS 11-Nov-2012
Network Requirements Workshop

1. What science is being done?
2. What instruments and facilities are used?
3. What is the process/workflow of science?
4. How are the use of instruments and facilities, the process of
   science, and other aspects of the science going to change
   over the next 5 years?
5. What is coming beyond 5 years out?
6. Are new instruments or facilities being built, or are there
   other significant changes coming?
J-PAS
Data transfer requirements
(to be validated)
                    ~270 TB/y   CSIC
                      images




           T80S
      35 TB/y raw
International collaboration
Academic Networks Worldwide
RedCLARA Logical Topology  August
2012



                         1Gbps




                                                 2.5Gbps


    622 Mbps
                           SP


                                POA (foreseen)




               10 Gbps




                         http://www.redclara.net           10 Gbps
Hybrid
                                                                 Networks

 Since the beginning of the Internet, NRENs provide the routed IP
  service
 Around 2002, NRENs have begun to provide two network services:
  routed IP (traditional Internet)
  end-to-end virtual circuits (a.k.a. lightpaths)
    This lightpath service is intended for users with high QoS needs, usually
      guaranteed bandwidth, as is implemented by segregation
      between their traffic and the general routed IP traffic.

 The GLIF organisation (www.glif.is) coordinates international
High bandwidth research connectivity
       (lightpaths for supporting international collaboration)




GLIF world map, 2011                                http://www.glif.is
GLIF links in
South America



   RNP networks
   Ip棚 backbone
      (29,000 km)
     metro networks in
      state capitals
     GIGA optical testbed,
      from RNP and CPqD
     links 20 research
      institutions in 7 cities
      (750 km)
     KyaTera research
      network in S. Paulo
     links research
      institutions in 11
      cities (1500 km)
Examples of use of international
lightpaths
Why all this is relevant to you?
Why?
 R&E networks in Brazil, and especially RNP, are
  funded by government agencies to provide quality
  network services to the national R&E community


 In most cases, this is handled normally by providing
  R&E institutions with a connection to our networks,
  which operate standard Internet services of good
  quality.


 However, there are times when this is not enough
Network Requirements and
Expectations
  Expected transfer rates to transfer data
  As a first step in improving your network
     performance, it is critical to have a baseline
     understanding of what speed you should expect
     from your network connection under ideal
     conditions.
    The following shows how long it takes to transfer
     1 Terabyte of data across various speed networks:
      10 Mbps network: 300h (12.5 days)
      100 Mbps network: 30h
      1Gbps network: 3h
      10Gbps network: 20min
Network Requirements and
Expectations




    http://fasterdata.es.net/fasterdata-home/requirements-and-expectations/
Inadequate performance for critical
applications
  In some cases, the standard Internet services are not
  good enough for high-performance or data-intensive
  projects.
  Sensitive to perturbations caused by security devices:
  - Numerous cases of
  firewalls causing problems
  - Often difficult to diagnose
  - Router filters can often
  provide equivalent security
  without the performance
  impact
 Science and Enterprise network requirements are in conflict
                                  28
Recommended
 approaches
Remedies which can be
applied
 Tuning of networking software is generally necessary on
 high bandwidth and long latency data connections,
 because of the peculiarities of TCP implementations

 In the case of high QoS requirements it is often
 necessary to use lightpaths, to avoid interference with
 cross traffic

 In many cases, both these
 approaches are required



                             30
The Cip坦 Experimental Service
 We are now beginning to deploy dynamic circuits as an
  experimental service on our network
    This will also interoperate with similar services in other networks.
Getting support
 If you need advice or assistance with these network
  problems, it is important to get in touch with network
  support
  1. At your own institution
  2. At your state network provider
     www.rnp.br/pops/index.php
  3. In the case of an specific circuit (lightpath) services, you
     may contact RNP directly at pd@rnp.br
RNP Website: Backbone
Operations
RNP Website: Backbone
Operations          http://www.rnp.br/ceo/




  Tools to help verify
  network performance
  statistics
RNP Backbone Statistics
Network Panorama




                     http://www.rnp.br/ceo/
RNP Backbone Statistics




                     http://www.rnp.br/ceo/
Network Diagnostic Tool (NDT)
 Test your bandwidth from your computer to the
 RNPs PoP


   S達o Paulo: http://ndt.pop-sp.rnp.br
   Rio de Janeiro: http://ndt.pop-rj.rnp.br
   Florianopolis: http://ndt.pop-sc.rnp.br
Recommended Approach
 On a high-speed network it takes less time to transfer 1 Terabyte of data
   than one might expect.
 It is usually sub-optimal to try and get 900 megabits per second of
   throughput on a 1 gigabit per second network path in order to move one
   or two terabytes of data per day. The disk subsystem can also be a
   bottleneck - simple storage systems often have trouble filling a 1 gigabit
   per second pipe.
 In general it is not a good idea to try to completely saturate the network,
   as you will likely end up causing problems for both yourself and others
   trying to use the same link. A good rule of thumb is that for periodic
   transfers it should be straightforward to get throughput equivalent to 1/4 to
   1/3 of a shared path that has nominal background load.
 For example, if you know your receiving host is connected to 1 Gbps
   Ethernet, then a target of speed of 150-200 Mbps is reasonable. You can
   adjust the number of parallel streams (as described on the tools page)
   that you are using to achieve this.
 Many labs and large universities are connected at speeds of at least 1
                                                                         32
   Gbps, and most LANs are at least 100 Mbps, so if you don't get at least
Performance using TCP
 There are 3 important variables (there are others)
    that affect TCP performance: packet loss, latency
    (or RTT - Round Trip Time), and buffer size/window.
    All are interrelated.
   The optimal buffer size is twice the product
    bandwidth*delay of the link/connection:
                 buffer size = bandwidth x RTT
   e.g.: if the result of ping if 50ms and the end-to-end
    network is all 1G or 10G Ethernet, the TCP receiving
    buffers (an operating system parameter) should be:
          0.05 seg x (1 Gbit / 8 bits) = 6.25 MBytes
TCP Congestion Avoidance
Algorithms
  The TCP reno congestion avoidance algorithm was the default in all
     TCP implementations for many years. However, as networks got
     faster and faster it became clear that reno would not work well for
     high bandwidth delay product networks. To address this a number
     of new congestion avoidance algorithms were developed, including:
    reno: Traditional TCP used by almost all other operating systems.
     (default)
    cubic: CUBIC-TCP
    bic: BIC-TCP
    htcp: Hamilton TCP
    vegas: TCP Vegas
    westwood: optimized for lossy networks
    Most Linux distributions now use cubic by default, and Windows
     now uses compound tcp. If you are using an older version of Linux,
     be sure to change the default from reno to cubic or htcp.
    More details on can be found at:
     http://en.wikipedia.org/wiki/TCP_congestion_avoidance_algorithm
TCP Congestion Avoidance
Algorithms many years. However, as networks the default and all TCP
  The TCP reno congestion avoidance algorithm was
   implementations for                             got faster
                                                              in
                                                                 faster it
     became clear that reno would not work well for high bandwidth delay product
     networks. To address this a number of new congestion avoidance algorithms were
     developed, including:
    reno: Traditional TCP used by almost all other operating systems (default)
    cubic: CUBIC-TCP
    bic: BIC-TCP
    htcp: Hamilton TCP
    vegas: TCP Vegas
  westwood: optimized for lossy networks
  Most Linux distributions now use cubic by default, and Windows
     now uses compound tcp. If you are using an older version of
     Linux, be sure to change the default from reno to cubic or htcp.
    More details on can be found at:
     http://en.wikipedia.org/wiki/TCP_congestion_avoidance_algorithm
MTU Issues
 Jumbo Ethernet frames can increase
 performance by a factor of 2-4.
 ping tool can be used to verify the MTU size.
 For example, on Linux you can do:
 ping -s 8972 -M do -c 4 10.200.200.12

 Other tools that can help verify the MTU size are
  scamper and tracepath
Say No to scp: Why you should avoid scp over a
WAN
  In a Unix environment scp, sftp, and rsync are commonly
     used to copy data between hosts.
    While these tools work fine in a local environment, they
     perform poorly on a WAN.
    The openssh versions of scp and sftp have a built in 1 MB
     buffer (previously only 64 KB in openssh older than version
     4.7) that severely limits performance on a WAN.
    rsync is not part of the openssh distribution, but typically uses
     ssh as transport (and is subject to the limitations imposed by
     the underlying ssh implementation).
    DO NOT USE THESE TOOLS if you need to transfer large
     data sets across a network path with a RTT of more than
     around 25ms.
    More information is here.
Why you should avoid scp over a WAN
(cont.)
  The following results are typical: scp is 10x slower
     than single stream GridFTP, and 50x slower than
     parallel GridFTP.
    Sample Results
     Berkeley, CA to Argonne, IL (near Chicago).
     RTT = 53 ms, network capacity = 10Gbps.
Demilitarizing your network
        for science

             42
A Simple Science DMZ
 A simple Science DMZ has several essential components. These include
  dedicated access to high-performance wide area networks and advanced
  services infrastructures, high-performance network equipment, and
  dedicated science resources such as Data Transfer Nodes. Here is a
  diagram of a simple Science DMZ showing these components and data
  paths:
Science DMZ: Supercomputer Center
Network below illustrates a simplified supercomputer center
  The diagram
   network. While this may not look much like the previous simple
   Science DMZ diagram, the same principles are used in its
   design.
Science DMZ: Big Data Site
   For sites that handle very large data volumes (e.g. for big experiments such as the LHC), individual data
    transfer nodes are not enough.
   Data transfer clusters are needed: groups of machines serve data from multi-petabyte data stores.
   The same principles of the Science DMZ apply - dedicated systems are used for data transfer, and the
    path to the wide area is clean, simple, and easy to troubleshoot. Test and measurement are integrated in
    multiple locations to enable fault isolation. This network is similar to the supercomputer center example in
    that the wide area data path covers the entire network front-end.
Data Transfer Node (DTN)
 Computer systems used for wide area data transfers perform far
    better if they are purpose-built and dedicated to the function of wide
    area data transfer. These systems, which we call Data Transfer
    Nodes (DTNs), are typically PC-based Linux servers built with high-
    quality components and configured specifically for wide area data
    transfer.
   ESnet has assembled a reference implementation of a host that
    can be deployed as a DTN or as a high-speed GridFTP test
    machine.
   The host can fill a 10Gbps network connection with disk-to-disk
    data transfers using GridFTP.
   The total cost of this server was around $10K, or $12.5K with the
    more expensive RAID controller. If your DTN node is used only as
    a data cache using RAID0 instead of a reliable storage server using
    RAID5, you can get by with the less expensive RAID controller.
   Key aspects of the configuration include: recent version of the
DTN Hardware Description
 Chassis: AC SuperMicro SM-936A-R1200B 3U 19" Rack Case with
    Dual 1200W PS
   Motherboard: SuperMicro X8DAH+F version 1.0c
   CPU: 2 x Intel Xeon Nehalem E5530 2.4GHz
   Memory: 6 x 4GB DDR3-1066MHz ECC/REG
   I/O Controller: 2 x 3ware SAS 9750SA-8i (about $600) or 3ware SAS
    9750-24i4e (about $1500)
   Disks: 16 x Seagate 500GB SAS HDD 7,200 RPM ST3500620SS
   Network Controller: Myricom 10G-PCIE2-8B2-2S+E


 Linux Distribution
    Most recent distribution of CentOS Linux
    Install 3ware driver: http://www.3ware.com/support/download.asp
    Install ext4 utilities: yum install e4fsprogs.x86_64
DTN Tuning
   Add to /etc/sysctl.conf, then run sysctl -p
   # standard TCP tuning for 10GE
   net.core.rmem_max = 33554432
   net.core.wmem_max = 33554432
   net.ipv4.tcp_rmem = 4096 87380 33554432
   net.ipv4.tcp_wmem = 4096 65536 33554432
   net.ipv4.tcp_no_metrics_save = 1
   net.core.netdev_max_backlog = 250000


   Add to /etc/rc.local
   #Increase the size of data the kernel will read ahead (this favors sequential reads)
   /sbin/blockdev --setra 262144 /dev/sdb
   /sbin/blockdev --setra 262144 /dev/sdc
   /sbin/blockdev --setra 262144 /dev/sdd


   # increase txqueuelen
   /sbin/ifconfig eth2 txqueuelen 10000
   /sbin/ifconfig eth3 txqueuelen 10000


   # make sure cubic and htcp are loaded
   /sbin/modprobe tcp_htcp
   /sbin/modprobe tcp_cubic
   # set default to htcp
   /sbin/sysctl net.ipv4.tcp_congestion_control=htcp


   # with the Myricom 10G NIC increasing interrupt coalencing helps a lot:
DTN Tuning (cont.)
 Tools
 Install a data transfer tool such as GridFTP - see the
  GridFTP quick start page. Information on other tools can be
  found on the tools page.
 Performance Results for this configuration
 Back-to-Back Testing using GridFTP
 - memory to memory, 1 10GE NIC: 9.9 Gbps
  - memory to memory, 4 10GE NICs: 38 Gbps
  - disk to disk: 9.6 Gbps (1.2 GBytes/sec) using large files on
  all 3 disk partitions in parallel
References (1/3)
 TCP Performance Tuning for WAN Transfers - NASA HECC Knowledge Base
    http://www.nas.nasa.gov/hecc/support/kb/TCP-Performance-Tuning-for-WAN-Transfers_137.html
 Google's software-defined/OpenFlow backbone drives WAN links to 100 per cent utilization -
    Computerworld
    http://www.computerworld.com.au/article/427022/google_software-defined_openflow_backbone_drives_wan_l
 Achieving 98Gbps of Crosscountry TCP traffic using 2.5 hosts, 10 x 10G NICs, and 10 TCP streams
    http://www.internet2.edu/presentations/jt2012winter/20120125-Pouyoul-JT-lighting.pdf
 Tutorials / Talks
   Achieving the Science DMZ: Eli Dart, Eric Pouyoul, Brian Tierney, and Joe Breen, Joint Techs, January 2012. (
    watch the webcast)
   Tutorial in 4 sections: Overview and Archetecture, Building a Data Transfer Node,
    Bulk Data Transfer Tools and PerfSONAR, Case Study: University of Utah's Science DMZ

   How to Build a Low Cost Data Transfer Node: Eric Pouyoul, Brian Tierney and Eli Dart, Joint Techs, July 2011.

   High Performance Bulk Data Transfer: (includes TCP tuning tutorial), Brian Tierney and Joe Metzger, Joint Techs,
    July 2010.

   Science Data Movement: Deployment of a Capability:Eli Dart, Joint Techs, January 2010.

   Bulk Data Transfer Tutorial, Brian Tierney, September 2009

   Internet2 Performance Workshop, current slides

   SC06 Tutorial on high performance networking, Phil Dykstra, Nov 2006



                                                                                                                    43
References (2/3)
 Papers
   O'Reilly ONLamp Article on TCP Tuning



 Tuning
   PSC TCP performance tuning guide
   SARA Server Performance Tuning Guide



 Troubleshooting
   Fermilab Network Troubleshooting Methodology
   Geant2 Network Tuning Knowledge Base



 Network and OS Tuning
   Linux IP Tuning Info
   Linux TCP Tuning Info

   A Comparison of Alternative Transport protocols




                                                      44
References (3/3)
 Network Performance measurement tools
   Convert Bytes/Sec to bits/sec, etc.

   Measurement Lab Tools

   Speed Guide's performance tester and TCP analyzer . (mostly useful for home users)

   ICSI's Netalyzr

   CAIDA Taxonomy

   SLAC Tool List

   iperf vs ttcp vs nuttcp comparison

   Sally Floyd's list of Bandwidth Estimation Tools

   Linux Foundation's TCP Testing Page



 Others
   bufferbloat.net: Site devoted to pointing out the problems with large network buffers on slower networks, such as
    homes or wireless.




                                                                                                                 45
Thank you / Obrigado!
   Leandro Ciuffo - leandro.ciuffo@rnp.br
   Alex Moura - alex@rnp.br
   Twitter: @RNP_pd




                                            53

More Related Content

RNP 5th J-PAS 11-Nov-2012

  • 1. RNP Brazilian National Education and Research Network ICT Support for Large-scale Science 5th J-PAS collaboration meeting / 11-Sep-2012 Leandro N. Ciuffo Alex S. Moura leandro.ciuffo@rnp.br alex.moura@rnp.br 1
  • 3. Qualified as a non-profit Social Organization (OS) maintained by federal public resources of Government budget includes items to cover costs network and also RNP operating costs Support by MCTI, MEC and now MinC (Culture) RNP monitored by MCTI, CGU and TCU Nacional INPE - Instituto de Pesquisas Espaciais Additional projects supported by sectorial funds directly or LNCC - Laborat坦rio INPA - Instituto Nacional through Nacional de Computa巽達o management contract, plus MS (Health) da de Pesquisas Cient鱈fica Reserach Unit of the Ministry of S&T Amaz担nia http://www.mct.gov.br/index.php/content/view/741.html
  • 4. Build and operate R&E networks Maintenance and continued renewal of infrastructure RNP backbone of 2000 has been renewed 3 times (2004, 2005 and 2011), with large increases in maximum link capacity from 25 Mbps to 10 Gbps (factor of 400) Metro networks have been built in capital cities to provide access to Point of Presence (PoP) at 1 Gbps or more www.redecomep.rnp.br International capacity has increased from 300 Mbps in 2002 to over 20 Gbps since 2009 (factor of 70). RNP has also played a major role in building the RedCLARA (Latin American regional) network, linking R&E networks from more than 12 countries www.redclara.net Testbed networks for network experimentation, especially project GIGA (with CPqD) since 2003 and the EU-BR FIBRE project (2011- 2014)
  • 5. Ip棚 Network RNP Backbone Boa Vista Macap叩 Fortaleza Manaus Salvador Brasilia S達o Paulo Rio de Janeiro Bandwidth Commodity Internet RedCLARA (to Europe) Minimum: 20Mbps 酷鉛看姻庄温稼坦沿看鉛庄壊 Americas Light (to USA) Maximum..: 10Gbps Porto Alegre Aggregated: 250Gbps http://www.rnp.br/backbone/
  • 6. MCTI Roadmap 2012-2015 http://www.mcti.gov.br/index.php/content/view/335668.html
  • 7. Metro Networks 23 cities operational 6 under deployment 13 planned 1980 Km to IP network Institution A RNP PoP Institution B Institution C http://redecomep.rnp.br
  • 11. Why your data-driven research is relevant to RNP?
  • 12. Science paradigms evolution Data-intensive research unify theory, experiment and simulation at scale. big data Computational Simulations simulating complex phenomena. in silico Theoretical Modeling e,g, Kepler's and Newtonw卒s laws Empirical Science describing natural phenomena
  • 13. Key components of a new research infrastructure Scientific portal Local services and workflows Big Data Processing Registering Discovering Publishers Harvesting Users & Indexing Publishing Information Visualization Data Repositories Instruments
  • 15. Network Requirements Workshop 1. What science is being done? 2. What instruments and facilities are used? 3. What is the process/workflow of science? 4. How are the use of instruments and facilities, the process of science, and other aspects of the science going to change over the next 5 years? 5. What is coming beyond 5 years out? 6. Are new instruments or facilities being built, or are there other significant changes coming?
  • 16. J-PAS Data transfer requirements (to be validated) ~270 TB/y CSIC images T80S 35 TB/y raw
  • 19. RedCLARA Logical Topology August 2012 1Gbps 2.5Gbps 622 Mbps SP POA (foreseen) 10 Gbps http://www.redclara.net 10 Gbps
  • 20. Hybrid Networks Since the beginning of the Internet, NRENs provide the routed IP service Around 2002, NRENs have begun to provide two network services: routed IP (traditional Internet) end-to-end virtual circuits (a.k.a. lightpaths) This lightpath service is intended for users with high QoS needs, usually guaranteed bandwidth, as is implemented by segregation between their traffic and the general routed IP traffic. The GLIF organisation (www.glif.is) coordinates international
  • 21. High bandwidth research connectivity (lightpaths for supporting international collaboration) GLIF world map, 2011 http://www.glif.is
  • 22. GLIF links in South America RNP networks Ip棚 backbone (29,000 km) metro networks in state capitals GIGA optical testbed, from RNP and CPqD links 20 research institutions in 7 cities (750 km) KyaTera research network in S. Paulo links research institutions in 11 cities (1500 km)
  • 23. Examples of use of international lightpaths
  • 24. Why all this is relevant to you?
  • 25. Why? R&E networks in Brazil, and especially RNP, are funded by government agencies to provide quality network services to the national R&E community In most cases, this is handled normally by providing R&E institutions with a connection to our networks, which operate standard Internet services of good quality. However, there are times when this is not enough
  • 26. Network Requirements and Expectations Expected transfer rates to transfer data As a first step in improving your network performance, it is critical to have a baseline understanding of what speed you should expect from your network connection under ideal conditions. The following shows how long it takes to transfer 1 Terabyte of data across various speed networks: 10 Mbps network: 300h (12.5 days) 100 Mbps network: 30h 1Gbps network: 3h 10Gbps network: 20min
  • 27. Network Requirements and Expectations http://fasterdata.es.net/fasterdata-home/requirements-and-expectations/
  • 28. Inadequate performance for critical applications In some cases, the standard Internet services are not good enough for high-performance or data-intensive projects. Sensitive to perturbations caused by security devices: - Numerous cases of firewalls causing problems - Often difficult to diagnose - Router filters can often provide equivalent security without the performance impact Science and Enterprise network requirements are in conflict 28
  • 30. Remedies which can be applied Tuning of networking software is generally necessary on high bandwidth and long latency data connections, because of the peculiarities of TCP implementations In the case of high QoS requirements it is often necessary to use lightpaths, to avoid interference with cross traffic In many cases, both these approaches are required 30
  • 31. The Cip坦 Experimental Service We are now beginning to deploy dynamic circuits as an experimental service on our network This will also interoperate with similar services in other networks.
  • 32. Getting support If you need advice or assistance with these network problems, it is important to get in touch with network support 1. At your own institution 2. At your state network provider www.rnp.br/pops/index.php 3. In the case of an specific circuit (lightpath) services, you may contact RNP directly at pd@rnp.br
  • 34. RNP Website: Backbone Operations http://www.rnp.br/ceo/ Tools to help verify network performance statistics
  • 35. RNP Backbone Statistics Network Panorama http://www.rnp.br/ceo/
  • 36. RNP Backbone Statistics http://www.rnp.br/ceo/
  • 37. Network Diagnostic Tool (NDT) Test your bandwidth from your computer to the RNPs PoP S達o Paulo: http://ndt.pop-sp.rnp.br Rio de Janeiro: http://ndt.pop-rj.rnp.br Florianopolis: http://ndt.pop-sc.rnp.br
  • 38. Recommended Approach On a high-speed network it takes less time to transfer 1 Terabyte of data than one might expect. It is usually sub-optimal to try and get 900 megabits per second of throughput on a 1 gigabit per second network path in order to move one or two terabytes of data per day. The disk subsystem can also be a bottleneck - simple storage systems often have trouble filling a 1 gigabit per second pipe. In general it is not a good idea to try to completely saturate the network, as you will likely end up causing problems for both yourself and others trying to use the same link. A good rule of thumb is that for periodic transfers it should be straightforward to get throughput equivalent to 1/4 to 1/3 of a shared path that has nominal background load. For example, if you know your receiving host is connected to 1 Gbps Ethernet, then a target of speed of 150-200 Mbps is reasonable. You can adjust the number of parallel streams (as described on the tools page) that you are using to achieve this. Many labs and large universities are connected at speeds of at least 1 32 Gbps, and most LANs are at least 100 Mbps, so if you don't get at least
  • 39. Performance using TCP There are 3 important variables (there are others) that affect TCP performance: packet loss, latency (or RTT - Round Trip Time), and buffer size/window. All are interrelated. The optimal buffer size is twice the product bandwidth*delay of the link/connection: buffer size = bandwidth x RTT e.g.: if the result of ping if 50ms and the end-to-end network is all 1G or 10G Ethernet, the TCP receiving buffers (an operating system parameter) should be: 0.05 seg x (1 Gbit / 8 bits) = 6.25 MBytes
  • 40. TCP Congestion Avoidance Algorithms The TCP reno congestion avoidance algorithm was the default in all TCP implementations for many years. However, as networks got faster and faster it became clear that reno would not work well for high bandwidth delay product networks. To address this a number of new congestion avoidance algorithms were developed, including: reno: Traditional TCP used by almost all other operating systems. (default) cubic: CUBIC-TCP bic: BIC-TCP htcp: Hamilton TCP vegas: TCP Vegas westwood: optimized for lossy networks Most Linux distributions now use cubic by default, and Windows now uses compound tcp. If you are using an older version of Linux, be sure to change the default from reno to cubic or htcp. More details on can be found at: http://en.wikipedia.org/wiki/TCP_congestion_avoidance_algorithm
  • 41. TCP Congestion Avoidance Algorithms many years. However, as networks the default and all TCP The TCP reno congestion avoidance algorithm was implementations for got faster in faster it became clear that reno would not work well for high bandwidth delay product networks. To address this a number of new congestion avoidance algorithms were developed, including: reno: Traditional TCP used by almost all other operating systems (default) cubic: CUBIC-TCP bic: BIC-TCP htcp: Hamilton TCP vegas: TCP Vegas westwood: optimized for lossy networks Most Linux distributions now use cubic by default, and Windows now uses compound tcp. If you are using an older version of Linux, be sure to change the default from reno to cubic or htcp. More details on can be found at: http://en.wikipedia.org/wiki/TCP_congestion_avoidance_algorithm
  • 42. MTU Issues Jumbo Ethernet frames can increase performance by a factor of 2-4. ping tool can be used to verify the MTU size. For example, on Linux you can do: ping -s 8972 -M do -c 4 10.200.200.12 Other tools that can help verify the MTU size are scamper and tracepath
  • 43. Say No to scp: Why you should avoid scp over a WAN In a Unix environment scp, sftp, and rsync are commonly used to copy data between hosts. While these tools work fine in a local environment, they perform poorly on a WAN. The openssh versions of scp and sftp have a built in 1 MB buffer (previously only 64 KB in openssh older than version 4.7) that severely limits performance on a WAN. rsync is not part of the openssh distribution, but typically uses ssh as transport (and is subject to the limitations imposed by the underlying ssh implementation). DO NOT USE THESE TOOLS if you need to transfer large data sets across a network path with a RTT of more than around 25ms. More information is here.
  • 44. Why you should avoid scp over a WAN (cont.) The following results are typical: scp is 10x slower than single stream GridFTP, and 50x slower than parallel GridFTP. Sample Results Berkeley, CA to Argonne, IL (near Chicago). RTT = 53 ms, network capacity = 10Gbps.
  • 45. Demilitarizing your network for science 42
  • 46. A Simple Science DMZ A simple Science DMZ has several essential components. These include dedicated access to high-performance wide area networks and advanced services infrastructures, high-performance network equipment, and dedicated science resources such as Data Transfer Nodes. Here is a diagram of a simple Science DMZ showing these components and data paths:
  • 47. Science DMZ: Supercomputer Center Network below illustrates a simplified supercomputer center The diagram network. While this may not look much like the previous simple Science DMZ diagram, the same principles are used in its design.
  • 48. Science DMZ: Big Data Site For sites that handle very large data volumes (e.g. for big experiments such as the LHC), individual data transfer nodes are not enough. Data transfer clusters are needed: groups of machines serve data from multi-petabyte data stores. The same principles of the Science DMZ apply - dedicated systems are used for data transfer, and the path to the wide area is clean, simple, and easy to troubleshoot. Test and measurement are integrated in multiple locations to enable fault isolation. This network is similar to the supercomputer center example in that the wide area data path covers the entire network front-end.
  • 49. Data Transfer Node (DTN) Computer systems used for wide area data transfers perform far better if they are purpose-built and dedicated to the function of wide area data transfer. These systems, which we call Data Transfer Nodes (DTNs), are typically PC-based Linux servers built with high- quality components and configured specifically for wide area data transfer. ESnet has assembled a reference implementation of a host that can be deployed as a DTN or as a high-speed GridFTP test machine. The host can fill a 10Gbps network connection with disk-to-disk data transfers using GridFTP. The total cost of this server was around $10K, or $12.5K with the more expensive RAID controller. If your DTN node is used only as a data cache using RAID0 instead of a reliable storage server using RAID5, you can get by with the less expensive RAID controller. Key aspects of the configuration include: recent version of the
  • 50. DTN Hardware Description Chassis: AC SuperMicro SM-936A-R1200B 3U 19" Rack Case with Dual 1200W PS Motherboard: SuperMicro X8DAH+F version 1.0c CPU: 2 x Intel Xeon Nehalem E5530 2.4GHz Memory: 6 x 4GB DDR3-1066MHz ECC/REG I/O Controller: 2 x 3ware SAS 9750SA-8i (about $600) or 3ware SAS 9750-24i4e (about $1500) Disks: 16 x Seagate 500GB SAS HDD 7,200 RPM ST3500620SS Network Controller: Myricom 10G-PCIE2-8B2-2S+E Linux Distribution Most recent distribution of CentOS Linux Install 3ware driver: http://www.3ware.com/support/download.asp Install ext4 utilities: yum install e4fsprogs.x86_64
  • 51. DTN Tuning Add to /etc/sysctl.conf, then run sysctl -p # standard TCP tuning for 10GE net.core.rmem_max = 33554432 net.core.wmem_max = 33554432 net.ipv4.tcp_rmem = 4096 87380 33554432 net.ipv4.tcp_wmem = 4096 65536 33554432 net.ipv4.tcp_no_metrics_save = 1 net.core.netdev_max_backlog = 250000 Add to /etc/rc.local #Increase the size of data the kernel will read ahead (this favors sequential reads) /sbin/blockdev --setra 262144 /dev/sdb /sbin/blockdev --setra 262144 /dev/sdc /sbin/blockdev --setra 262144 /dev/sdd # increase txqueuelen /sbin/ifconfig eth2 txqueuelen 10000 /sbin/ifconfig eth3 txqueuelen 10000 # make sure cubic and htcp are loaded /sbin/modprobe tcp_htcp /sbin/modprobe tcp_cubic # set default to htcp /sbin/sysctl net.ipv4.tcp_congestion_control=htcp # with the Myricom 10G NIC increasing interrupt coalencing helps a lot:
  • 52. DTN Tuning (cont.) Tools Install a data transfer tool such as GridFTP - see the GridFTP quick start page. Information on other tools can be found on the tools page. Performance Results for this configuration Back-to-Back Testing using GridFTP - memory to memory, 1 10GE NIC: 9.9 Gbps - memory to memory, 4 10GE NICs: 38 Gbps - disk to disk: 9.6 Gbps (1.2 GBytes/sec) using large files on all 3 disk partitions in parallel
  • 53. References (1/3) TCP Performance Tuning for WAN Transfers - NASA HECC Knowledge Base http://www.nas.nasa.gov/hecc/support/kb/TCP-Performance-Tuning-for-WAN-Transfers_137.html Google's software-defined/OpenFlow backbone drives WAN links to 100 per cent utilization - Computerworld http://www.computerworld.com.au/article/427022/google_software-defined_openflow_backbone_drives_wan_l Achieving 98Gbps of Crosscountry TCP traffic using 2.5 hosts, 10 x 10G NICs, and 10 TCP streams http://www.internet2.edu/presentations/jt2012winter/20120125-Pouyoul-JT-lighting.pdf Tutorials / Talks Achieving the Science DMZ: Eli Dart, Eric Pouyoul, Brian Tierney, and Joe Breen, Joint Techs, January 2012. ( watch the webcast) Tutorial in 4 sections: Overview and Archetecture, Building a Data Transfer Node, Bulk Data Transfer Tools and PerfSONAR, Case Study: University of Utah's Science DMZ How to Build a Low Cost Data Transfer Node: Eric Pouyoul, Brian Tierney and Eli Dart, Joint Techs, July 2011. High Performance Bulk Data Transfer: (includes TCP tuning tutorial), Brian Tierney and Joe Metzger, Joint Techs, July 2010. Science Data Movement: Deployment of a Capability:Eli Dart, Joint Techs, January 2010. Bulk Data Transfer Tutorial, Brian Tierney, September 2009 Internet2 Performance Workshop, current slides SC06 Tutorial on high performance networking, Phil Dykstra, Nov 2006 43
  • 54. References (2/3) Papers O'Reilly ONLamp Article on TCP Tuning Tuning PSC TCP performance tuning guide SARA Server Performance Tuning Guide Troubleshooting Fermilab Network Troubleshooting Methodology Geant2 Network Tuning Knowledge Base Network and OS Tuning Linux IP Tuning Info Linux TCP Tuning Info A Comparison of Alternative Transport protocols 44
  • 55. References (3/3) Network Performance measurement tools Convert Bytes/Sec to bits/sec, etc. Measurement Lab Tools Speed Guide's performance tester and TCP analyzer . (mostly useful for home users) ICSI's Netalyzr CAIDA Taxonomy SLAC Tool List iperf vs ttcp vs nuttcp comparison Sally Floyd's list of Bandwidth Estimation Tools Linux Foundation's TCP Testing Page Others bufferbloat.net: Site devoted to pointing out the problems with large network buffers on slower networks, such as homes or wireless. 45
  • 56. Thank you / Obrigado! Leandro Ciuffo - leandro.ciuffo@rnp.br Alex Moura - alex@rnp.br Twitter: @RNP_pd 53

Editor's Notes

  • #4: http://www.mct.gov.br/index.php/content/view/741.html
  • #16: Examples: High volume data transfers between LHC grid sites Streaming very high resolution movies
  • #19: http://www.geant.net/ Media_Centre/Media_Li brary/Pages/ Maps.aspx
  • #21: Examples: High volume data transfers between LHC grid sites Streaming very high resolution movies