際際滷

際際滷Share a Scribd company logo
Netflix: Embracing the Cloud
Neil Hunt, CPO / Yury Izrailevsky, VP Engineering
2012 re:Invent Netflix: embracing the cloud final
Netflix  Service Unavailable  Database Crashed

Rest assured that the right people
are losing sleep to fix this problem!

We expect to resume service in approximately 72h


12 Aug 2008 03:12am
2012 re:Invent Netflix: embracing the cloud final
Availability
                4 x nines




    Scale             Performance
 Unconstrained              Unlimited
horizontal scaling          compute
2012 re:Invent Netflix: embracing the cloud final
2012 re:Invent Netflix: embracing the cloud final
2012 re:Invent Netflix: embracing the cloud final
 Experimented with both
 Ended up with NoSQL for almost everything important
2012 re:Invent Netflix: embracing the cloud final
2012 re:Invent Netflix: embracing the cloud final
Transitional Infrastructure: Roman 檎庄糸庄稼乙
2012 re:Invent Netflix: embracing the cloud final
2012 re:Invent Netflix: embracing the cloud final
2012 re:Invent Netflix: embracing the cloud final
2012 re:Invent Netflix: embracing the cloud final
Phase          Components         Data & Prerequisites
Trial (2009)   Streaming Player   Content keys (RO)
                                  Membership status (RO)
Development Member product        Content catalog (RW)
(2010-11)   pages and APIs        Personalization data
                                  (RW) & recs algorithms
                                  AB Test data (RW)
Followthrough Account and         Membership data (RW)
(2011-12)     membership
Final (2013) Payments             PCI and SOX data
2012 re:Invent Netflix: embracing the cloud final
2012 re:Invent Netflix: embracing the cloud final
Availability
                4 x nines




    Scale             Performance
 Unconstrained              Unlimited
horizontal scaling          compute
Scalability   Performance   Availability
Scalability   Performance   Availability
1/4/2009
      2/4/2009
      3/4/2009
      4/4/2009
      5/4/2009
      6/4/2009
      7/4/2009
      8/4/2009
      9/4/2009
     10/4/2009
     11/4/2009
     12/4/2009
      1/4/2010
      2/4/2010
      3/4/2010
      4/4/2010
      5/4/2010
      6/4/2010
      7/4/2010
      8/4/2010
      9/4/2010
     10/4/2010
     11/4/2010
     12/4/2010
      1/4/2011
      2/4/2011
      3/4/2011
      4/4/2011
      5/4/2011
      6/4/2011
      7/4/2011
      8/4/2011
      9/4/2011
     10/4/2011
     11/4/2011
     12/4/2011
      1/4/2012
      2/4/2012
      3/4/2012
      4/4/2012
      5/4/2012
      6/4/2012
      7/4/2012
      8/4/2012
                 Scaling Netflix Streaming Service: Weekly Streaming Starts




23
Netflix Cross-Regional Cloud Architecture
Goal: Regional Failover
Building Global Netflix Streaming Product
Scalability   Performance   Availability
Weekly Cloud Cost Per Streaming Start (last 12 months)




                                                         28
Simian Army: Cloud Efficiency Automation
   Janitor Monkey
     Regularly scrape unused capacity
     Clean up instances, ASGs, ELBs, SGs, etc.
   Efficiency Monkey
     AI-based resource under-usage detection
      (CPU, memory, etc.)
   Automated Deletion of Old Data
     TTL for S3 (using ObjectExpiration)




                                                  29
Cyclical Streaming Usage Pattern




                                   30
Load-Based Auto Scaling




                             50%+ Cost Saving
                                          Scale up/down
                                             by 70%+




         Move to Load-Based Scaling



                                                          31
                                                          31
Scalability   Performance   Availability
A Truly Great Service      Has To Just Work!




            Availability Goal: 99.99%
          (30 secs/week at peak traffic)
                                                33
7/17/2011
 7/24/2011
 7/31/2011
  8/7/2011
 8/14/2011
 8/21/2011
 8/28/2011
  9/4/2011
 9/11/2011
 9/18/2011
 9/25/2011
 10/2/2011
 10/9/2011
10/16/2011
10/23/2011
10/30/2011
 11/6/2011
11/13/2011
11/20/2011
11/27/2011
 12/4/2011
12/11/2011
12/18/2011
12/25/2011
  1/1/2012
  1/8/2012
 1/15/2012
 1/22/2012
 1/29/2012
  2/5/2012
 2/12/2012
 2/19/2012
 2/26/2012
  3/4/2012
 3/11/2012
 3/18/2012
 3/25/2012
  4/1/2012
  4/8/2012
 4/15/2012
 4/22/2012
                                                                                            Other AWS Outages




 4/29/2012
  5/6/2012
 5/13/2012
 5/20/2012
 5/27/2012
  6/3/2012
 6/10/2012
 6/17/2012
 6/24/2012
  7/1/2012
                                                                                                                Historical Streaming Availability (13wkMA)




  7/8/2012
                                                                          Outage




 7/15/2012
 7/22/2012
 7/29/2012
  8/5/2012
 8/12/2012
                                                                          AWS / Netflix




 8/19/2012
 8/26/2012
                                                                          June 29th, 2012




  9/2/2012
  9/9/2012
 9/16/2012
 9/23/2012
 9/30/2012
 10/7/2012
    14-Oct
10/21/2012
10/28/2012
             Using Redundancy in AWS Infrastructure to Survive Failures




 11/4/2012
11/11/2012
Cascading Failures




               API




              Instant
              Queue




              SimpleDB

                         35
Netflix Cloud Architecture




                             36
Cascading Failures




                   X                      
99% Availability       99% Availability       99% Availability


                       300
            99%              = 4.90%                             37
Strategies to Improve Availability




        Graceful
       Degradation                   Redundancy




                                                  38
Graceful Degradation




                       39
Redundancy



                           A        B       C
    Zone   Zone   Zone          Cassandra
     A      B      C



                                S3 Backup

   Redundancy
 Across Availability           Secure Cloud
      Zones                      Backup

                         Storage Redundancy
                               Across
                                                40
                          Regions, Vendors
Testing Fault Tolerance: Simian Army




   Chaos Monkey       Latency Monkey   Chaos Gorilla




                                                       4
Open Source Portal at http://netflix.github.com
Superstorm Sandy

                   AWS Infrastructure Held Up


                   >2x Netflix Streaming Usage
                   in East Coast Markets
                      Boston
                      New York
                      Philadelphia
                      Baltimore
                      D.C.
Focus on Building a Great Streaming Product




                                              44
Netflix at 2012 re:Invent

Date/Time         Presenter             Topic
Wed 8:30-10:00    Reed Hastings         Keynote with Andy Jassy
Wed 1:00-1:45     Coburn Watson         Optimizing Costs with AWS
Wed 2:05-2:55     Kevin McEntee         Netflixs Transcoding Transformation
Wed 3:25-4:15     Neil Hunt / Yury I.   Netflix: Embracing the Cloud
Wed 4:30-5:20     Adrian Cockcroft      High Availability Architecture at Netflix
Thu 10:30-11:20   Jeremy Edberg         Rainmakers  Operating Clouds
Thu 11:35-12:25   Kurt Brown            Data Science with Elastic Map Reduce (EMR)
Thu 11:35-12:25   Jason Chan            Security Panel: Learn from CISOs working with AWS
Thu 3:00-3:50     Adrian Cockcroft      Compute & Networking Masters Customer Panel
Thu 3:00-3:50     Ruslan M./Gregg U.    Optimizing Your Cassandra Database on AWS
Thu 4:05-4:55     Ariel Tseitlin        Intro to Chaos Monkey and the Simian Army
We are sincerely eager to
 hear your feedback on this
presentation and on re:Invent.

 Please fill out an evaluation
   form when you have a
            chance.
We are sincerely eager to
 hear your feedback on this
presentation and on re:Invent.

 Please fill out an evaluation
   form when you have a
            chance.

More Related Content

2012 re:Invent Netflix: embracing the cloud final

  • 1. Netflix: Embracing the Cloud Neil Hunt, CPO / Yury Izrailevsky, VP Engineering
  • 3. Netflix Service Unavailable Database Crashed Rest assured that the right people are losing sleep to fix this problem! We expect to resume service in approximately 72h 12 Aug 2008 03:12am
  • 5. Availability 4 x nines Scale Performance Unconstrained Unlimited horizontal scaling compute
  • 9. Experimented with both Ended up with NoSQL for almost everything important
  • 17. Phase Components Data & Prerequisites Trial (2009) Streaming Player Content keys (RO) Membership status (RO) Development Member product Content catalog (RW) (2010-11) pages and APIs Personalization data (RW) & recs algorithms AB Test data (RW) Followthrough Account and Membership data (RW) (2011-12) membership Final (2013) Payments PCI and SOX data
  • 20. Availability 4 x nines Scale Performance Unconstrained Unlimited horizontal scaling compute
  • 21. Scalability Performance Availability
  • 22. Scalability Performance Availability
  • 23. 1/4/2009 2/4/2009 3/4/2009 4/4/2009 5/4/2009 6/4/2009 7/4/2009 8/4/2009 9/4/2009 10/4/2009 11/4/2009 12/4/2009 1/4/2010 2/4/2010 3/4/2010 4/4/2010 5/4/2010 6/4/2010 7/4/2010 8/4/2010 9/4/2010 10/4/2010 11/4/2010 12/4/2010 1/4/2011 2/4/2011 3/4/2011 4/4/2011 5/4/2011 6/4/2011 7/4/2011 8/4/2011 9/4/2011 10/4/2011 11/4/2011 12/4/2011 1/4/2012 2/4/2012 3/4/2012 4/4/2012 5/4/2012 6/4/2012 7/4/2012 8/4/2012 Scaling Netflix Streaming Service: Weekly Streaming Starts 23
  • 26. Building Global Netflix Streaming Product
  • 27. Scalability Performance Availability
  • 28. Weekly Cloud Cost Per Streaming Start (last 12 months) 28
  • 29. Simian Army: Cloud Efficiency Automation Janitor Monkey Regularly scrape unused capacity Clean up instances, ASGs, ELBs, SGs, etc. Efficiency Monkey AI-based resource under-usage detection (CPU, memory, etc.) Automated Deletion of Old Data TTL for S3 (using ObjectExpiration) 29
  • 31. Load-Based Auto Scaling 50%+ Cost Saving Scale up/down by 70%+ Move to Load-Based Scaling 31 31
  • 32. Scalability Performance Availability
  • 33. A Truly Great Service Has To Just Work! Availability Goal: 99.99% (30 secs/week at peak traffic) 33
  • 34. 7/17/2011 7/24/2011 7/31/2011 8/7/2011 8/14/2011 8/21/2011 8/28/2011 9/4/2011 9/11/2011 9/18/2011 9/25/2011 10/2/2011 10/9/2011 10/16/2011 10/23/2011 10/30/2011 11/6/2011 11/13/2011 11/20/2011 11/27/2011 12/4/2011 12/11/2011 12/18/2011 12/25/2011 1/1/2012 1/8/2012 1/15/2012 1/22/2012 1/29/2012 2/5/2012 2/12/2012 2/19/2012 2/26/2012 3/4/2012 3/11/2012 3/18/2012 3/25/2012 4/1/2012 4/8/2012 4/15/2012 4/22/2012 Other AWS Outages 4/29/2012 5/6/2012 5/13/2012 5/20/2012 5/27/2012 6/3/2012 6/10/2012 6/17/2012 6/24/2012 7/1/2012 Historical Streaming Availability (13wkMA) 7/8/2012 Outage 7/15/2012 7/22/2012 7/29/2012 8/5/2012 8/12/2012 AWS / Netflix 8/19/2012 8/26/2012 June 29th, 2012 9/2/2012 9/9/2012 9/16/2012 9/23/2012 9/30/2012 10/7/2012 14-Oct 10/21/2012 10/28/2012 Using Redundancy in AWS Infrastructure to Survive Failures 11/4/2012 11/11/2012
  • 35. Cascading Failures API Instant Queue SimpleDB 35
  • 37. Cascading Failures X 99% Availability 99% Availability 99% Availability 300 99% = 4.90% 37
  • 38. Strategies to Improve Availability Graceful Degradation Redundancy 38
  • 40. Redundancy A B C Zone Zone Zone Cassandra A B C S3 Backup Redundancy Across Availability Secure Cloud Zones Backup Storage Redundancy Across 40 Regions, Vendors
  • 41. Testing Fault Tolerance: Simian Army Chaos Monkey Latency Monkey Chaos Gorilla 4
  • 42. Open Source Portal at http://netflix.github.com
  • 43. Superstorm Sandy AWS Infrastructure Held Up >2x Netflix Streaming Usage in East Coast Markets Boston New York Philadelphia Baltimore D.C.
  • 44. Focus on Building a Great Streaming Product 44
  • 45. Netflix at 2012 re:Invent Date/Time Presenter Topic Wed 8:30-10:00 Reed Hastings Keynote with Andy Jassy Wed 1:00-1:45 Coburn Watson Optimizing Costs with AWS Wed 2:05-2:55 Kevin McEntee Netflixs Transcoding Transformation Wed 3:25-4:15 Neil Hunt / Yury I. Netflix: Embracing the Cloud Wed 4:30-5:20 Adrian Cockcroft High Availability Architecture at Netflix Thu 10:30-11:20 Jeremy Edberg Rainmakers Operating Clouds Thu 11:35-12:25 Kurt Brown Data Science with Elastic Map Reduce (EMR) Thu 11:35-12:25 Jason Chan Security Panel: Learn from CISOs working with AWS Thu 3:00-3:50 Adrian Cockcroft Compute & Networking Masters Customer Panel Thu 3:00-3:50 Ruslan M./Gregg U. Optimizing Your Cassandra Database on AWS Thu 4:05-4:55 Ariel Tseitlin Intro to Chaos Monkey and the Simian Army
  • 46. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.
  • 47. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.

Editor's Notes

  • #26: Make clear its still tentative, not a committed project longer term