ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
The Datacenter as a Computer
          Chapter 7


                       2009/12/20
                      id:daisukebe
Agenda
?   7 Dealing with Failures and Repairs
    ?   7.1 Implications of Software-Based Fault Tolerance
    ?   7.2 Categorizing Faults
        ?   7.2.1 Fault Severity
        ?   7.2.2 Causes of Service-Level Faults
    ?   7.3 Machine-Level Failures
        ?   7.3.1 What Causes Machine Crashes?
        ?   7.3.2 Predicting Faults
    ?   7.4 Repairs
    ?   7.5 Tolerating Faults, Not Hiding Them
?   7
    ?   7.1 Implications of Software-Based Fault Tolerance
    ?   7.2 Categorizing Faults
        ?   7.2.1 Fault Severity
        ?   7.2.2 Causes of Service-Level Faults
    ?   7.3 Machine-Level Failures
        ?   7.3.1 What Causes Machine Crashes?
        ?   7.3.2 Predicting Faults
    ?   7.4 Repairs
    ?   7.5 Tolerating Faults, Not Hiding Them
?               =>

?              H/W

?   WSC                  H/W


?   MTBF 30
    -> 10000     1   1

?   WSC
?   7 Dealing with Failures and Repairs
    ?   7.1

    ?   7.2 Categorizing Faults
        ?     7.2.1 Fault Severity
        ?     7.2.2 Causes of Service-Level Faults
    ?   7.3 Machine-Level Failures
        ?     7.3.1 What Causes Machine Crashes?
        ?     7.3.2 Predicting Faults
    ?   7.4 Repairs
    ?   7.5 Tolerating Faults, Not Hiding Them
?
        1
        2

?
    ?
    ?
?                     RAID
    =>

?
    =>

?   RAID        GFS

?
?          OS
?
?
?
    =>

?
Google

?                   DRAM

?        2000

?   =>

?                          100%


?                          ECC DRAM
ECC DRAM

?   ECC   Error Correction Code


                                                                 1
                       2
    via http://www.nec.co.jp/products/express/tech/memory/index.shtml
?   7 Dealing with Failures and Repairs
    ?   7.1 Implications of Software-Based Fault Tolerance
    ?   7.2

        ?     7.2.1 Fault Severity
        ?     7.2.2 Causes of Service-Level Faults
    ?   7.3 Machine-Level Failures
        ?     7.3.1 What Causes Machine Crashes?
        ?     7.3.2 Predicting Faults
    ?   7.4 Repairs
    ?   7.5 Tolerating Faults, Not Hiding Them
~




?
?
?   WSC

?
?   7 Dealing with Failures and Repairs
    ?   7.1 Implications of Software-Based Fault Tolerance
    ?   7.2 Categorizing Faults
        ?   7.2.1

        ?   7.2.2 Causes of Service-Level Faults
    ?   7.3 Machine-Level Failures
        ?   7.3.1 What Causes Machine Crashes?
        ?   7.3.2 Predicting Faults
    ?   7.4 Repairs
    ?   7.5 Tolerating Faults, Not Hiding Them
?
    ?   Corrupted:

    ?   Unreachable:


    ?   Degraded:


    ?   Masked:


?
?
    => 99.0%
    =>         99.0%


?
?
?   7 Dealing with Failures and Repairs
    ?   7.1 Implications of Software-Based Fault Tolerance
    ?   7.2 Categorizing Faults
        ?   7.2.1 Fault Severity
        ?   7.2.2

    ?   7.3 Machine-Level Failures
        ?   7.3.1 What Causes Machine Crashes?
        ?   7.3.2 Predicting Faults
    ?   7.4 Repairs
    ?   7.5 Tolerating Faults, Not Hiding Them
?   Oppenheimer        500

    ?
    ?   H/W                  10-25%

?   Gray      Tandem

    ?   H/W -> 10%             -> 60%   -> 20%

?
Google
?   Oppenheimer

?
?   7 Dealing with Failures and Repairs
    ?   7.1 Implications of Software-Based Fault Tolerance
    ?   7.2 Categorizing Faults
        ?     7.2.1 Fault Severity
        ?     7.2.2 Causes of Service-Level Faults
    ?   7.3

        ?     7.3.1 What Causes Machine Crashes?
        ?     7.3.2 Predicting Faults
    ?   7.4 Repairs
    ?   7.5 Tolerating Faults, Not Hiding Them
H/W
?   Google

    ?   95%    1       reboot

    ?   1%    reboot
?   reboot            55%    6

?   25% 6        30         1% 1

?            3

?                                  reboot
?   7 Dealing with Failures and Repairs
    ?   7.1 Implications of Software-Based Fault Tolerance
    ?   7.2 Categorizing Faults
        ?   7.2.1 Fault Severity
        ?   7.2.2 Causes of Service-Level Faults
    ?   7.3 Machine-Level Failures
        ?   7.3.1

        ?   7.3.2 Predicting Faults
    ?   7.4 Repairs
    ?   7.5 Tolerating Faults, Not Hiding Them
?   DRAM

    ?   ECC

?
    ?
    ?
?   7 Dealing with Failures and Repairs
    ?   7.1 Implications of Software-Based Fault Tolerance
    ?   7.2 Categorizing Faults
        ?   7.2.1 Fault Severity
        ?   7.2.2 Causes of Service-Level Faults
    ?   7.3 Machine-Level Failures
        ?   7.3.1 What Causes Machine Crashes?
        ?   7.3.2

    ?   7.4 Repairs
    ?   7.5 Tolerating Faults, Not Hiding Them
?   10                  100%


?   WSC


?   Pinheiro   Google


?   WSC
?   7 Dealing with Failures and Repairs
    ?   7.1 Implications of Software-Based Fault Tolerance
    ?   7.2 Categorizing Faults
        ?     7.2.1 Fault Severity
        ?     7.2.2 Causes of Service-Level Faults
    ?   7.3 Machine-Level Failures
        ?     7.3.1 What Causes Machine Crashes?
        ?     7.3.2 Predicting Faults
    ?   7.4

    ?   7.5 Tolerating Faults, Not Hiding Them
WSC

?              WSC


?
?
    =>

?
    =>
Google

?    System Health


?

?
?   7 Dealing with Failures and Repairs
    ?   7.1 Implications of Software-Based Fault Tolerance
    ?   7.2 Categorizing Faults
        ?     7.2.1 Fault Severity
        ?     7.2.2 Causes of Service-Level Faults
    ?   7.3 Machine-Level Failures
        ?     7.3.1 What Causes Machine Crashes?
        ?     7.3.2 Predicting Faults
    ?   7.4 Repairs
    ?   7.5
?
    =>


?
IT


?   24                   5-15%
          /

?   Google

    ?    WSC

    ?    40000        5% 200
Thank you!
Ad

Recommended

ODP
Decotai Shiumachi 091220
Sho Shimauchi
?
ODP
Datacenter As Acomputer µÚ6ÕÂ
Akinori YOSHIDA
?
PDF
Data Center As A Computer 5ÕÂǰ°ë
Akinori YOSHIDA
?
PPTX
HPC¥Õ¥©©`¥é¥à2015 B-1RandD 100 Award ÊÜÙpÓ›ÄîÖvÑÝ ³£ÎÂË®À䥹¥Ñ¥³¥óHP Apollo 8000é_°k¥¨¥ó¥¸¥Ë¥¢¤Ë¤è¤ëÕQÉúÃØÔ’ N...
ÈÕ±¾¥Ò¥å©`¥ì¥Ã¥È?¥Ñ¥Ã¥«©`¥ÉÖêʽ»áÉç
?
PDF
Data Center As A Computer 2ÕÂǰ°ë
Akinori YOSHIDA
?
PDF
¡¾±á¾±²Ô±ð³¾´Ç²õ°Â´Ç°ù±ô»å2015¡¿µþ2-5³å³å¿I¥Ç©`¥¿¥»¥ó¥¿©`¤Î¥µ©`¥Ó¥¹Éܽé¤È±á¾±²Ô±ð³¾´Ç²õ¤òÀûÓä·¤¿ÔËÓüàÊӤˤĤ¤¤Æ
Hinemos
?
PDF
Cloud Operations and Analytics: Improving Distributed Systems Reliability usi...
Jorge Cardoso
?
PDF
Fault tolerance
Gaurav Rawat
?
DOC
Software rejuvenation
RVCE
?
DOC
Software rejuvenation
RVCE2
?
DOC
Software rejuvenation
RVCE
?
PDF
Improving Resilience by Creating Storms in the Cloud
Michalis Zervos
?
PDF
CSL Seminar presented by Cassiano Campes - 17-03-13
Cassiano Campes
?
PPTX
Fault tolerance techniques
ECEDepartmentJSREC
?
PDF
An Investigation of Fault Tolerance Techniques in Cloud Computing
ijtsrd
?
PDF
Dependability Benchmarking by Injecting Software Bugs
Roberto Natella
?
PDF
RFOH
girirajr4
?
PDF
techcodes
Sheik Mohideen
?
PPTX
The Private Cloud, Principles, Patterns and Concepts
Microsoft TechNet - Belgium and Luxembourg
?
PPT
Software and Hardware Reliability
Sandeep Patalay
?
PPTX
The resident season 3 is a bit of a triangle
ABDULRAHMANSANI3
?
PDF
Fault Tolerance 101
C4Media
?
PDF
CYB 102 ¨C Fundamentals of Cyber Security 2.pdf
Abolarinwa
?
PDF
CYB 102 ¨C Fundamentals of Cyber Security 2.pdf
Abolarinwa
?
PDF
A Practical Fault Tolerance Approach in Cloud Computing Using Support Vector ...
BOHR International Journal of Smart Computing and Information Technology
?
PPTX
Guided Trajectory Exploration of GT systems presented at PNGT 2010
?bel Heged¨¹s
?
PPT
Adaptive fault tolerance in cloud survey
www.pixelsolutionbd.com
?
PPTX
PriyaDharshini distributed operating system
PriyadharshiniVS
?

More Related Content

Similar to Deco3 (20)

DOC
Software rejuvenation
RVCE
?
DOC
Software rejuvenation
RVCE2
?
DOC
Software rejuvenation
RVCE
?
PDF
Improving Resilience by Creating Storms in the Cloud
Michalis Zervos
?
PDF
CSL Seminar presented by Cassiano Campes - 17-03-13
Cassiano Campes
?
PPTX
Fault tolerance techniques
ECEDepartmentJSREC
?
PDF
An Investigation of Fault Tolerance Techniques in Cloud Computing
ijtsrd
?
PDF
Dependability Benchmarking by Injecting Software Bugs
Roberto Natella
?
PDF
RFOH
girirajr4
?
PDF
techcodes
Sheik Mohideen
?
PPTX
The Private Cloud, Principles, Patterns and Concepts
Microsoft TechNet - Belgium and Luxembourg
?
PPT
Software and Hardware Reliability
Sandeep Patalay
?
PPTX
The resident season 3 is a bit of a triangle
ABDULRAHMANSANI3
?
PDF
Fault Tolerance 101
C4Media
?
PDF
CYB 102 ¨C Fundamentals of Cyber Security 2.pdf
Abolarinwa
?
PDF
CYB 102 ¨C Fundamentals of Cyber Security 2.pdf
Abolarinwa
?
PDF
A Practical Fault Tolerance Approach in Cloud Computing Using Support Vector ...
BOHR International Journal of Smart Computing and Information Technology
?
PPTX
Guided Trajectory Exploration of GT systems presented at PNGT 2010
?bel Heged¨¹s
?
PPT
Adaptive fault tolerance in cloud survey
www.pixelsolutionbd.com
?
PPTX
PriyaDharshini distributed operating system
PriyadharshiniVS
?
Software rejuvenation
RVCE
?
Software rejuvenation
RVCE2
?
Software rejuvenation
RVCE
?
Improving Resilience by Creating Storms in the Cloud
Michalis Zervos
?
CSL Seminar presented by Cassiano Campes - 17-03-13
Cassiano Campes
?
Fault tolerance techniques
ECEDepartmentJSREC
?
An Investigation of Fault Tolerance Techniques in Cloud Computing
ijtsrd
?
Dependability Benchmarking by Injecting Software Bugs
Roberto Natella
?
techcodes
Sheik Mohideen
?
The Private Cloud, Principles, Patterns and Concepts
Microsoft TechNet - Belgium and Luxembourg
?
Software and Hardware Reliability
Sandeep Patalay
?
The resident season 3 is a bit of a triangle
ABDULRAHMANSANI3
?
Fault Tolerance 101
C4Media
?
CYB 102 ¨C Fundamentals of Cyber Security 2.pdf
Abolarinwa
?
CYB 102 ¨C Fundamentals of Cyber Security 2.pdf
Abolarinwa
?
A Practical Fault Tolerance Approach in Cloud Computing Using Support Vector ...
BOHR International Journal of Smart Computing and Information Technology
?
Guided Trajectory Exploration of GT systems presented at PNGT 2010
?bel Heged¨¹s
?
Adaptive fault tolerance in cloud survey
www.pixelsolutionbd.com
?
PriyaDharshini distributed operating system
PriyadharshiniVS
?

Deco3

  • 1. The Datacenter as a Computer Chapter 7 2009/12/20 id:daisukebe
  • 2. Agenda ? 7 Dealing with Failures and Repairs ? 7.1 Implications of Software-Based Fault Tolerance ? 7.2 Categorizing Faults ? 7.2.1 Fault Severity ? 7.2.2 Causes of Service-Level Faults ? 7.3 Machine-Level Failures ? 7.3.1 What Causes Machine Crashes? ? 7.3.2 Predicting Faults ? 7.4 Repairs ? 7.5 Tolerating Faults, Not Hiding Them
  • 3. ? 7 ? 7.1 Implications of Software-Based Fault Tolerance ? 7.2 Categorizing Faults ? 7.2.1 Fault Severity ? 7.2.2 Causes of Service-Level Faults ? 7.3 Machine-Level Failures ? 7.3.1 What Causes Machine Crashes? ? 7.3.2 Predicting Faults ? 7.4 Repairs ? 7.5 Tolerating Faults, Not Hiding Them
  • 4. ? => ? H/W ? WSC H/W ? MTBF 30 -> 10000 1 1 ? WSC
  • 5. ? 7 Dealing with Failures and Repairs ? 7.1 ? 7.2 Categorizing Faults ? 7.2.1 Fault Severity ? 7.2.2 Causes of Service-Level Faults ? 7.3 Machine-Level Failures ? 7.3.1 What Causes Machine Crashes? ? 7.3.2 Predicting Faults ? 7.4 Repairs ? 7.5 Tolerating Faults, Not Hiding Them
  • 6. ? 1 2 ? ? ?
  • 7. ? RAID => ? => ? RAID GFS ? ? OS
  • 8. ? ? ? => ?
  • 9. Google ? DRAM ? 2000 ? => ? 100% ? ECC DRAM
  • 10. ECC DRAM ? ECC Error Correction Code 1 2 via http://www.nec.co.jp/products/express/tech/memory/index.shtml
  • 11. ? 7 Dealing with Failures and Repairs ? 7.1 Implications of Software-Based Fault Tolerance ? 7.2 ? 7.2.1 Fault Severity ? 7.2.2 Causes of Service-Level Faults ? 7.3 Machine-Level Failures ? 7.3.1 What Causes Machine Crashes? ? 7.3.2 Predicting Faults ? 7.4 Repairs ? 7.5 Tolerating Faults, Not Hiding Them
  • 12. ~ ? ? ? WSC ?
  • 13. ? 7 Dealing with Failures and Repairs ? 7.1 Implications of Software-Based Fault Tolerance ? 7.2 Categorizing Faults ? 7.2.1 ? 7.2.2 Causes of Service-Level Faults ? 7.3 Machine-Level Failures ? 7.3.1 What Causes Machine Crashes? ? 7.3.2 Predicting Faults ? 7.4 Repairs ? 7.5 Tolerating Faults, Not Hiding Them
  • 14. ? ? Corrupted: ? Unreachable: ? Degraded: ? Masked: ?
  • 15. ? => 99.0% => 99.0% ? ?
  • 16. ? 7 Dealing with Failures and Repairs ? 7.1 Implications of Software-Based Fault Tolerance ? 7.2 Categorizing Faults ? 7.2.1 Fault Severity ? 7.2.2 ? 7.3 Machine-Level Failures ? 7.3.1 What Causes Machine Crashes? ? 7.3.2 Predicting Faults ? 7.4 Repairs ? 7.5 Tolerating Faults, Not Hiding Them
  • 17. ? Oppenheimer 500 ? ? H/W 10-25% ? Gray Tandem ? H/W -> 10% -> 60% -> 20% ?
  • 18. Google ? Oppenheimer ?
  • 19. ? 7 Dealing with Failures and Repairs ? 7.1 Implications of Software-Based Fault Tolerance ? 7.2 Categorizing Faults ? 7.2.1 Fault Severity ? 7.2.2 Causes of Service-Level Faults ? 7.3 ? 7.3.1 What Causes Machine Crashes? ? 7.3.2 Predicting Faults ? 7.4 Repairs ? 7.5 Tolerating Faults, Not Hiding Them
  • 20. H/W ? Google ? 95% 1 reboot ? 1% reboot
  • 21. ? reboot 55% 6 ? 25% 6 30 1% 1 ? 3 ? reboot
  • 22. ? 7 Dealing with Failures and Repairs ? 7.1 Implications of Software-Based Fault Tolerance ? 7.2 Categorizing Faults ? 7.2.1 Fault Severity ? 7.2.2 Causes of Service-Level Faults ? 7.3 Machine-Level Failures ? 7.3.1 ? 7.3.2 Predicting Faults ? 7.4 Repairs ? 7.5 Tolerating Faults, Not Hiding Them
  • 23. ? DRAM ? ECC ? ? ?
  • 24. ? 7 Dealing with Failures and Repairs ? 7.1 Implications of Software-Based Fault Tolerance ? 7.2 Categorizing Faults ? 7.2.1 Fault Severity ? 7.2.2 Causes of Service-Level Faults ? 7.3 Machine-Level Failures ? 7.3.1 What Causes Machine Crashes? ? 7.3.2 ? 7.4 Repairs ? 7.5 Tolerating Faults, Not Hiding Them
  • 25. ? 10 100% ? WSC ? Pinheiro Google ? WSC
  • 26. ? 7 Dealing with Failures and Repairs ? 7.1 Implications of Software-Based Fault Tolerance ? 7.2 Categorizing Faults ? 7.2.1 Fault Severity ? 7.2.2 Causes of Service-Level Faults ? 7.3 Machine-Level Failures ? 7.3.1 What Causes Machine Crashes? ? 7.3.2 Predicting Faults ? 7.4 ? 7.5 Tolerating Faults, Not Hiding Them
  • 27. WSC ? WSC ? ? => ? =>
  • 28. Google ? System Health ? ?
  • 29. ? 7 Dealing with Failures and Repairs ? 7.1 Implications of Software-Based Fault Tolerance ? 7.2 Categorizing Faults ? 7.2.1 Fault Severity ? 7.2.2 Causes of Service-Level Faults ? 7.3 Machine-Level Failures ? 7.3.1 What Causes Machine Crashes? ? 7.3.2 Predicting Faults ? 7.4 Repairs ? 7.5
  • 30. ? => ?
  • 31. IT ? 24 5-15% / ? Google ? WSC ? 40000 5% 200