data center failures, data center outages, What are the risks that may affect the availability of a data center,
1 of 6
Download to read offline
More Related Content
What are the risks that may affect the availability of a data center
1. What are the risks that may affect the availability of a data center
Availability of a data center means the maximum uptime that the operation of a
data center work without any failure. Availability is determined by a systems
reliability and its recovery time. Understanding that the system downtime can cause
major impact on business entities, it is necessary to know what are the factors that
can impact on data center availability.
Generally these factors can be divided into 4 and listed as below,
Nature
Human
Utility
Equipment
Nature
This factor is having one of the major impact on availability of a data centers. We
cant predict the nature of earth which may change any time and cause to a
complete disaster. This will include tornadoes, hurricanes, flooding , earthquakes etc.
Control against the natural calamities by humans are really less hence this can have
a major impact on availability of data center. Maintaining data access in the event of
a disaster can mean the difference between a companys success or failure. So let us
have a look at some of the incidents that were occurred in various companies and
their data centers.
Lightning: They say lightning doesnt strike the same place twice, but in 2015 one of
Googles European data centers was struck by lightning not once, but four times,
causing errors in 5% of the disks responsible for Google Compute Engine (GCE)
instances. Although the company restored many of the drives, an estimated
0.000001% of data stored in the data center was irrecoverably lost. While that might
not sound like much, try telling that to the customers who were affected by it.
Hurricanes: According to National Geographic, 2017 was the most expensive
hurricane season in U.S. history, costing roughly $200 billion. With their combination
of high winds, storm surge, and heavy rains, hurricanes are one of the most
dangerous natural disasters data centers must contend with. The sudden flooding
2. resulting from Hurricane Sandy in 2012 caused extensive data center outages in New
York and New Jersey. These failures were made even worse by the fact that backup
systems were located in the same geographic region and where knocked out by the
same weather event.
Tornadoes: A devastating 2011 tornado ripped through several hospital buildings in
Joplin, Missouri, one of which was a data center. While none of the data lost was
mission-critical, that was only because most of the information stored there had
been migrated to a new offsite data center just a few weeks earlier. Hospital officials
noted that if the tornado had hit a month earlier, the data loss would have been
catastrophic and rendered the hospital completely inoperable.
Flooding: Severe flooding in Leeds, UK caused a Vodafone data center to temporarily
lose power during Christmas of 2015. While data loss was negligible, the power
outage disrupted mobile phone service temporarily. Vodafone, of course, has a bit of
history with flooding, having suffered one of the most infamous data center disasters
when its Istanbul data center was devastated by flooding in 2009.
Earthquakes: So far, data centers have been lucky. Modern architectural standards
and additional precautions (such as special enclosures and rollers for server racks)
have gone a long way towards protecting data centers from earthquakes, even in
high-risk areas.
The Unexpected: Disaster planning is all about expecting the unexpected. Take, for
instance, the squirrel that knocked Yahoos Santa Clara data center offline for several
hours in 2010, or the truck that drove into a transformer feeding power into a
Backspace data center in 2007.
Human
According to a survey conducted by Aperture Research Institute, human errors are
behind 57.3% of all data center outages. The second most common reason was
improper failover with 43.7%.
3. Above: Diagram from the Aperture survey.
Let me tell you the another survey details as well,
According Uptime Institute: 70% of DC Outages due to Human Error and not by a fault in
the infrastructure design. Furthermore, mistakes that led to an outage can often be
traced to a poor decision by senior management.
The results from both the organization can be different due to the reasons that it may be
conducted on different entities and different environment. As a summary of both of these
surveys we can conclude that the DC outage due to human mistakes are really much higher
than any other dependencies. Lets take an example of human raised DC issues,
Activation of the emergency power-off (EPO) switch
Adjusting the temperature from Fahrenheit to Celsius
Pulling power cords out of equipment
Overloading a circuit
Not following standard policies or procedures
To minimize the risk of the human factor affecting operations, it is important to
have up-to-date documentation on everything connected to your data center and
manuals on how different critical operations should be performed. Manuals and
4. documentation together with scheduled tests should help you avoid many of the
problems and outages described in this survey.
Utility
In the case of a data center the major source of utility is the electric power that is
drawn to data center from local providers(can be a government entity or private
entity). The secondary utility for a data center would be the Diesel generators and
UPS systems. All other mechanical parts related to data center is directly or indirectly
depend on the availability of utility.
An Uptime Institute survey finds the power usage effectiveness of data centers is
better than ever. However it is also true that survey indicates that the power outages
have increased significantly. The Global Data Center Survey report from Uptime
Institute gathered responses from nearly 900 data center operators and IT
practitioners, both from major data center providers and from private, company-
owned data centers(you can download the report from above link).
Even though we do prepare all equipments for redundancies there is chances that
these machines may not work as expected at the time of any incidents. One of the
incident that I can get you is that - Diesel rotary uninterruptable power supply
(DRUPS) systems were implicated in power disruptions that in 2014 affected Amazon
Web Services in Sydney, a former Telecity facility called Sovereign House in London,
now owned by Digital Realty Trust, and the Singapore Stock Exchange. Disruption at
Amazon was caused by what the company called an unusually long voltage sag.. If
you go through these incident you will understand the root cause of outage is due to
utility failure and subsequent machines failed to start. Some of the incidents that is
reported in data center imminent failure is as below,
Generator fail to start.
Generator fails after X number of hours running.
Utility power partially fails(usually one of three phases- phase loss)
UPS fails to switch to battery
UPS fails to switch from battery to input power
From these incidents we can all say that maintaining the periodic checks, preventive
maintenance tasks are really important that would really help a lot to avoid the
impact of failures.
5. Equipment
As you know the data center infrastructure is a large collection of multiple
equipment and success is depending on the efficiency of all these together. Any
equipment related to electric, mechanical, cooling, networking , servers are having
chances to fail on an unexpected timeframe. Whether its a server reaching the end
of its five-year expected lifespan or a UPS backup battery dying before it should,
equipment failure is one of the most common causes of data center outages.
With todays powerful data center infrastructure management (DCIM) tools, facilities
can monitor the overall health of their own equipment as well as colocated assets.
While it may not be possible to predict every failure, sophisticated algorithms can
monitor equipment performance continually to anticipate when hardware is
reaching the end of its lifecycle or is prone to break down. When these problems are
identified, data center personnel can plan to switch out faulty or outdated
equipment without having to take critical systems offline. With the
right redundancies and backups and emergency spares, in place, even an unexpected
failure can be managed without compromising network performance.
Source : www.vxchnge.com & www.pingdom.com
6. Have a comment or points to be review? Knowledge is power and it increases by
sharing. Feel free to comment.