This document discusses the Brazilian National Education and Research Network (RNP) and its role in supporting large-scale science in Brazil. It provides an overview of RNP, including that it is a non-profit organization maintained by government funding that builds and operates research and education networks. It also discusses network infrastructure, international collaborations, requirements for data-intensive research, and recommendations for ensuring adequate network performance.
1 of 56
Download to read offline
More Related Content
RNP 5th J-PAS 11-Nov-2012
1. RNP
Brazilian National Education and Research Network
ICT Support for Large-scale Science
5th J-PAS collaboration meeting / 11-Sep-2012
Leandro N. Ciuffo Alex S. Moura
leandro.ciuffo@rnp.br alex.moura@rnp.br
1
3. Qualified as a non-profit Social Organization
(OS)
maintained by federal public resources of
Government budget includes items to cover costs
network and also RNP operating costs
Support by MCTI, MEC and now MinC (Culture)
RNP monitored by MCTI, CGU and TCU Nacional
INPE - Instituto
de Pesquisas Espaciais
Additional projects supported by sectorial funds directly or
LNCC - Laborat坦rio INPA - Instituto Nacional
through Nacional de Computa巽達o
management contract, plus MS (Health) da
de Pesquisas
Cient鱈fica
Reserach Unit of the Ministry of S&T Amaz担nia
http://www.mct.gov.br/index.php/content/view/741.html
4. Build and operate R&E networks
Maintenance and continued renewal of infrastructure
RNP backbone of 2000 has been renewed 3 times (2004, 2005 and
2011), with large increases in maximum link capacity from 25 Mbps to 10
Gbps (factor of 400)
Metro networks have been built in capital cities to provide access to
Point of Presence (PoP) at 1 Gbps or more
www.redecomep.rnp.br
International capacity has increased from 300 Mbps in 2002 to over 20
Gbps since 2009 (factor of 70). RNP has also played a major role in building
the RedCLARA (Latin American regional) network, linking R&E networks from
more than 12 countries www.redclara.net
Testbed networks for network experimentation, especially
project GIGA (with CPqD) since 2003 and the EU-BR FIBRE project (2011-
2014)
5. Ip棚 Network
RNP Backbone Boa Vista
Macap叩
Fortaleza
Manaus
Salvador
Brasilia
S達o Paulo
Rio de Janeiro
Bandwidth Commodity Internet
RedCLARA (to Europe)
Minimum: 20Mbps 酷鉛看姻庄温稼坦沿看鉛庄壊 Americas Light (to USA)
Maximum..: 10Gbps
Porto Alegre
Aggregated: 250Gbps
http://www.rnp.br/backbone/
7. Metro
Networks
23 cities operational
6 under deployment
13 planned
1980 Km
to IP network
Institution A RNP
PoP
Institution B
Institution C
http://redecomep.rnp.br
12. Science paradigms evolution
Data-intensive research
unify theory, experiment and simulation at scale. big data
Computational Simulations
simulating complex phenomena. in silico
Theoretical Modeling
e,g, Kepler's and Newtonw卒s laws
Empirical Science
describing natural phenomena
13. Key components of a new research
infrastructure
Scientific portal
Local
services and
workflows Big Data
Processing
Registering Discovering
Publishers
Harvesting Users
& Indexing
Publishing Information
Visualization
Data
Repositories
Instruments
15. Network Requirements Workshop
1. What science is being done?
2. What instruments and facilities are used?
3. What is the process/workflow of science?
4. How are the use of instruments and facilities, the process of
science, and other aspects of the science going to change
over the next 5 years?
5. What is coming beyond 5 years out?
6. Are new instruments or facilities being built, or are there
other significant changes coming?
20. Hybrid
Networks
Since the beginning of the Internet, NRENs provide the routed IP
service
Around 2002, NRENs have begun to provide two network services:
routed IP (traditional Internet)
end-to-end virtual circuits (a.k.a. lightpaths)
This lightpath service is intended for users with high QoS needs, usually
guaranteed bandwidth, as is implemented by segregation
between their traffic and the general routed IP traffic.
The GLIF organisation (www.glif.is) coordinates international
21. High bandwidth research connectivity
(lightpaths for supporting international collaboration)
GLIF world map, 2011 http://www.glif.is
22. GLIF links in
South America
RNP networks
Ip棚 backbone
(29,000 km)
metro networks in
state capitals
GIGA optical testbed,
from RNP and CPqD
links 20 research
institutions in 7 cities
(750 km)
KyaTera research
network in S. Paulo
links research
institutions in 11
cities (1500 km)
25. Why?
R&E networks in Brazil, and especially RNP, are
funded by government agencies to provide quality
network services to the national R&E community
In most cases, this is handled normally by providing
R&E institutions with a connection to our networks,
which operate standard Internet services of good
quality.
However, there are times when this is not enough
26. Network Requirements and
Expectations
Expected transfer rates to transfer data
As a first step in improving your network
performance, it is critical to have a baseline
understanding of what speed you should expect
from your network connection under ideal
conditions.
The following shows how long it takes to transfer
1 Terabyte of data across various speed networks:
10 Mbps network: 300h (12.5 days)
100 Mbps network: 30h
1Gbps network: 3h
10Gbps network: 20min
28. Inadequate performance for critical
applications
In some cases, the standard Internet services are not
good enough for high-performance or data-intensive
projects.
Sensitive to perturbations caused by security devices:
- Numerous cases of
firewalls causing problems
- Often difficult to diagnose
- Router filters can often
provide equivalent security
without the performance
impact
Science and Enterprise network requirements are in conflict
28
30. Remedies which can be
applied
Tuning of networking software is generally necessary on
high bandwidth and long latency data connections,
because of the peculiarities of TCP implementations
In the case of high QoS requirements it is often
necessary to use lightpaths, to avoid interference with
cross traffic
In many cases, both these
approaches are required
30
31. The Cip坦 Experimental Service
We are now beginning to deploy dynamic circuits as an
experimental service on our network
This will also interoperate with similar services in other networks.
32. Getting support
If you need advice or assistance with these network
problems, it is important to get in touch with network
support
1. At your own institution
2. At your state network provider
www.rnp.br/pops/index.php
3. In the case of an specific circuit (lightpath) services, you
may contact RNP directly at pd@rnp.br
37. Network Diagnostic Tool (NDT)
Test your bandwidth from your computer to the
RNPs PoP
S達o Paulo: http://ndt.pop-sp.rnp.br
Rio de Janeiro: http://ndt.pop-rj.rnp.br
Florianopolis: http://ndt.pop-sc.rnp.br
38. Recommended Approach
On a high-speed network it takes less time to transfer 1 Terabyte of data
than one might expect.
It is usually sub-optimal to try and get 900 megabits per second of
throughput on a 1 gigabit per second network path in order to move one
or two terabytes of data per day. The disk subsystem can also be a
bottleneck - simple storage systems often have trouble filling a 1 gigabit
per second pipe.
In general it is not a good idea to try to completely saturate the network,
as you will likely end up causing problems for both yourself and others
trying to use the same link. A good rule of thumb is that for periodic
transfers it should be straightforward to get throughput equivalent to 1/4 to
1/3 of a shared path that has nominal background load.
For example, if you know your receiving host is connected to 1 Gbps
Ethernet, then a target of speed of 150-200 Mbps is reasonable. You can
adjust the number of parallel streams (as described on the tools page)
that you are using to achieve this.
Many labs and large universities are connected at speeds of at least 1
32
Gbps, and most LANs are at least 100 Mbps, so if you don't get at least
39. Performance using TCP
There are 3 important variables (there are others)
that affect TCP performance: packet loss, latency
(or RTT - Round Trip Time), and buffer size/window.
All are interrelated.
The optimal buffer size is twice the product
bandwidth*delay of the link/connection:
buffer size = bandwidth x RTT
e.g.: if the result of ping if 50ms and the end-to-end
network is all 1G or 10G Ethernet, the TCP receiving
buffers (an operating system parameter) should be:
0.05 seg x (1 Gbit / 8 bits) = 6.25 MBytes
40. TCP Congestion Avoidance
Algorithms
The TCP reno congestion avoidance algorithm was the default in all
TCP implementations for many years. However, as networks got
faster and faster it became clear that reno would not work well for
high bandwidth delay product networks. To address this a number
of new congestion avoidance algorithms were developed, including:
reno: Traditional TCP used by almost all other operating systems.
(default)
cubic: CUBIC-TCP
bic: BIC-TCP
htcp: Hamilton TCP
vegas: TCP Vegas
westwood: optimized for lossy networks
Most Linux distributions now use cubic by default, and Windows
now uses compound tcp. If you are using an older version of Linux,
be sure to change the default from reno to cubic or htcp.
More details on can be found at:
http://en.wikipedia.org/wiki/TCP_congestion_avoidance_algorithm
41. TCP Congestion Avoidance
Algorithms many years. However, as networks the default and all TCP
The TCP reno congestion avoidance algorithm was
implementations for got faster
in
faster it
became clear that reno would not work well for high bandwidth delay product
networks. To address this a number of new congestion avoidance algorithms were
developed, including:
reno: Traditional TCP used by almost all other operating systems (default)
cubic: CUBIC-TCP
bic: BIC-TCP
htcp: Hamilton TCP
vegas: TCP Vegas
westwood: optimized for lossy networks
Most Linux distributions now use cubic by default, and Windows
now uses compound tcp. If you are using an older version of
Linux, be sure to change the default from reno to cubic or htcp.
More details on can be found at:
http://en.wikipedia.org/wiki/TCP_congestion_avoidance_algorithm
42. MTU Issues
Jumbo Ethernet frames can increase
performance by a factor of 2-4.
ping tool can be used to verify the MTU size.
For example, on Linux you can do:
ping -s 8972 -M do -c 4 10.200.200.12
Other tools that can help verify the MTU size are
scamper and tracepath
43. Say No to scp: Why you should avoid scp over a
WAN
In a Unix environment scp, sftp, and rsync are commonly
used to copy data between hosts.
While these tools work fine in a local environment, they
perform poorly on a WAN.
The openssh versions of scp and sftp have a built in 1 MB
buffer (previously only 64 KB in openssh older than version
4.7) that severely limits performance on a WAN.
rsync is not part of the openssh distribution, but typically uses
ssh as transport (and is subject to the limitations imposed by
the underlying ssh implementation).
DO NOT USE THESE TOOLS if you need to transfer large
data sets across a network path with a RTT of more than
around 25ms.
More information is here.
44. Why you should avoid scp over a WAN
(cont.)
The following results are typical: scp is 10x slower
than single stream GridFTP, and 50x slower than
parallel GridFTP.
Sample Results
Berkeley, CA to Argonne, IL (near Chicago).
RTT = 53 ms, network capacity = 10Gbps.
46. A Simple Science DMZ
A simple Science DMZ has several essential components. These include
dedicated access to high-performance wide area networks and advanced
services infrastructures, high-performance network equipment, and
dedicated science resources such as Data Transfer Nodes. Here is a
diagram of a simple Science DMZ showing these components and data
paths:
47. Science DMZ: Supercomputer Center
Network below illustrates a simplified supercomputer center
The diagram
network. While this may not look much like the previous simple
Science DMZ diagram, the same principles are used in its
design.
48. Science DMZ: Big Data Site
For sites that handle very large data volumes (e.g. for big experiments such as the LHC), individual data
transfer nodes are not enough.
Data transfer clusters are needed: groups of machines serve data from multi-petabyte data stores.
The same principles of the Science DMZ apply - dedicated systems are used for data transfer, and the
path to the wide area is clean, simple, and easy to troubleshoot. Test and measurement are integrated in
multiple locations to enable fault isolation. This network is similar to the supercomputer center example in
that the wide area data path covers the entire network front-end.
49. Data Transfer Node (DTN)
Computer systems used for wide area data transfers perform far
better if they are purpose-built and dedicated to the function of wide
area data transfer. These systems, which we call Data Transfer
Nodes (DTNs), are typically PC-based Linux servers built with high-
quality components and configured specifically for wide area data
transfer.
ESnet has assembled a reference implementation of a host that
can be deployed as a DTN or as a high-speed GridFTP test
machine.
The host can fill a 10Gbps network connection with disk-to-disk
data transfers using GridFTP.
The total cost of this server was around $10K, or $12.5K with the
more expensive RAID controller. If your DTN node is used only as
a data cache using RAID0 instead of a reliable storage server using
RAID5, you can get by with the less expensive RAID controller.
Key aspects of the configuration include: recent version of the
50. DTN Hardware Description
Chassis: AC SuperMicro SM-936A-R1200B 3U 19" Rack Case with
Dual 1200W PS
Motherboard: SuperMicro X8DAH+F version 1.0c
CPU: 2 x Intel Xeon Nehalem E5530 2.4GHz
Memory: 6 x 4GB DDR3-1066MHz ECC/REG
I/O Controller: 2 x 3ware SAS 9750SA-8i (about $600) or 3ware SAS
9750-24i4e (about $1500)
Disks: 16 x Seagate 500GB SAS HDD 7,200 RPM ST3500620SS
Network Controller: Myricom 10G-PCIE2-8B2-2S+E
Linux Distribution
Most recent distribution of CentOS Linux
Install 3ware driver: http://www.3ware.com/support/download.asp
Install ext4 utilities: yum install e4fsprogs.x86_64
51. DTN Tuning
Add to /etc/sysctl.conf, then run sysctl -p
# standard TCP tuning for 10GE
net.core.rmem_max = 33554432
net.core.wmem_max = 33554432
net.ipv4.tcp_rmem = 4096 87380 33554432
net.ipv4.tcp_wmem = 4096 65536 33554432
net.ipv4.tcp_no_metrics_save = 1
net.core.netdev_max_backlog = 250000
Add to /etc/rc.local
#Increase the size of data the kernel will read ahead (this favors sequential reads)
/sbin/blockdev --setra 262144 /dev/sdb
/sbin/blockdev --setra 262144 /dev/sdc
/sbin/blockdev --setra 262144 /dev/sdd
# increase txqueuelen
/sbin/ifconfig eth2 txqueuelen 10000
/sbin/ifconfig eth3 txqueuelen 10000
# make sure cubic and htcp are loaded
/sbin/modprobe tcp_htcp
/sbin/modprobe tcp_cubic
# set default to htcp
/sbin/sysctl net.ipv4.tcp_congestion_control=htcp
# with the Myricom 10G NIC increasing interrupt coalencing helps a lot:
52. DTN Tuning (cont.)
Tools
Install a data transfer tool such as GridFTP - see the
GridFTP quick start page. Information on other tools can be
found on the tools page.
Performance Results for this configuration
Back-to-Back Testing using GridFTP
- memory to memory, 1 10GE NIC: 9.9 Gbps
- memory to memory, 4 10GE NICs: 38 Gbps
- disk to disk: 9.6 Gbps (1.2 GBytes/sec) using large files on
all 3 disk partitions in parallel
53. References (1/3)
TCP Performance Tuning for WAN Transfers - NASA HECC Knowledge Base
http://www.nas.nasa.gov/hecc/support/kb/TCP-Performance-Tuning-for-WAN-Transfers_137.html
Google's software-defined/OpenFlow backbone drives WAN links to 100 per cent utilization -
Computerworld
http://www.computerworld.com.au/article/427022/google_software-defined_openflow_backbone_drives_wan_l
Achieving 98Gbps of Crosscountry TCP traffic using 2.5 hosts, 10 x 10G NICs, and 10 TCP streams
http://www.internet2.edu/presentations/jt2012winter/20120125-Pouyoul-JT-lighting.pdf
Tutorials / Talks
Achieving the Science DMZ: Eli Dart, Eric Pouyoul, Brian Tierney, and Joe Breen, Joint Techs, January 2012. (
watch the webcast)
Tutorial in 4 sections: Overview and Archetecture, Building a Data Transfer Node,
Bulk Data Transfer Tools and PerfSONAR, Case Study: University of Utah's Science DMZ
How to Build a Low Cost Data Transfer Node: Eric Pouyoul, Brian Tierney and Eli Dart, Joint Techs, July 2011.
High Performance Bulk Data Transfer: (includes TCP tuning tutorial), Brian Tierney and Joe Metzger, Joint Techs,
July 2010.
Science Data Movement: Deployment of a Capability:Eli Dart, Joint Techs, January 2010.
Bulk Data Transfer Tutorial, Brian Tierney, September 2009
Internet2 Performance Workshop, current slides
SC06 Tutorial on high performance networking, Phil Dykstra, Nov 2006
43
54. References (2/3)
Papers
O'Reilly ONLamp Article on TCP Tuning
Tuning
PSC TCP performance tuning guide
SARA Server Performance Tuning Guide
Troubleshooting
Fermilab Network Troubleshooting Methodology
Geant2 Network Tuning Knowledge Base
Network and OS Tuning
Linux IP Tuning Info
Linux TCP Tuning Info
A Comparison of Alternative Transport protocols
44
55. References (3/3)
Network Performance measurement tools
Convert Bytes/Sec to bits/sec, etc.
Measurement Lab Tools
Speed Guide's performance tester and TCP analyzer . (mostly useful for home users)
ICSI's Netalyzr
CAIDA Taxonomy
SLAC Tool List
iperf vs ttcp vs nuttcp comparison
Sally Floyd's list of Bandwidth Estimation Tools
Linux Foundation's TCP Testing Page
Others
bufferbloat.net: Site devoted to pointing out the problems with large network buffers on slower networks, such as
homes or wireless.
45
56. Thank you / Obrigado!
Leandro Ciuffo - leandro.ciuffo@rnp.br
Alex Moura - alex@rnp.br
Twitter: @RNP_pd
53