ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Performance and Fault Tolerance
for the Net?ix API
Ben Christensen
Software Engineer ¨C API Platform at Net?ix
@benjchristensen
http://www.linkedin.com/in/benjchristensen




http://techblog.net?ix.com/
Net?ix API




Dependency A              Dependency B           Dependency C



   Dependency D               Dependency E           Dependency F



       Dependency G              Dependency H            Dependency I



               Dependency J          Dependency K            Dependency L



                  Dependency M            Dependency N          Dependency O



                      Dependency P            Dependency Q              Dependency R
Net?ix API




Dependency A              Dependency B           Dependency C



   Dependency D               Dependency E           Dependency F



       Dependency G              Dependency H            Dependency I



               Dependency J          Dependency K            Dependency L



                  Dependency M            Dependency N          Dependency O



                      Dependency P            Dependency Q              Dependency R
Dozens of dependencies.

    One going bad takes everything down.


99.99%30          = 99.7% uptime
     0.3% of 1 billion = 3,000,000 failures

            2+ hours downtime/month
even if all dependencies have excellent uptime.

          Reality is generally worse.
Performance and Fault Tolerance for the Netflix API
Performance and Fault Tolerance for the Netflix API
Performance and Fault Tolerance for the Netflix API
No single dependency should
 take down the entire app.

         Fallback.
         Fail silent.
          Fail fast.

        Shed load.
Options

Aggressive Network Timeouts

   Semaphores (Tryable)

     Separate Threads

      Circuit Breaker
Performance and Fault Tolerance for the Netflix API
Tryable semaphores for ¡°trusted¡± clients and fallbacks

       Separate threads for ¡°untrusted¡± clients

 Aggressive timeouts on threads and network calls
             to ¡°give up and move on¡±

       Circuit breakers as the ¡°release valve¡±
Performance and Fault Tolerance for the Netflix API
Performance and Fault Tolerance for the Netflix API
Performance and Fault Tolerance for the Netflix API
Performance and Fault Tolerance for the Netflix API
30 rps x 0.2 seconds = 6 + breathing room = 10 threads

Thread-pool Queue size: 5-10 (0 doesn't work but get close to it)

                Thread-pool Size + Queue Size

                      Queuing is Not Free
Cost of Thread @ 75rps
  median - 90th - 99th (time in ms)




                 Time for thread to execute   Time user thread waited
Net?ix DependencyCommand Implementation
Net?ix DependencyCommand Implementation

              Fallbacks

               Cache
         Eventual Consistency
            Stubbed Data
           Empty Response
Net?ix DependencyCommand Implementation
So, how does it work in the real world?
Visualizing Circuits in Near-Realtime
    (latency is single-digit seconds, generally 1-2)




        Video available at
  https://vimeo.com/33576628
Rolling 10 second counters


1 minute latency percentiles




  2 minute rate change



circle color and size represent
   health and traf?c volume
API Daily Incoming vs Outgoing

Weekend                                        Weekend               Weekend




              8-10 Billion DependencyCommand Executions (threaded)




                        1.2 - 1.6 Billion Incoming Requests
API Hourly Incoming vs Outgoing

 Peak at 700M+ threaded DependencyCommand executions (200k+/second)




              Peak at 100M+ incoming requests (30k+/second)
Performance and Fault Tolerance for the Netflix API
Fallback.
Fail silent.
 Fail fast.

Shed load.
Net?ix API




Dependency A              Dependency B           Dependency C



   Dependency D               Dependency E           Dependency F



       Dependency G              Dependency H            Dependency I



               Dependency J          Dependency K            Dependency L



                  Dependency M            Dependency N          Dependency O



                      Dependency P            Dependency Q              Dependency R
Single Network Request from Clients
     (use LAN instead of WAN)

  Send Only The Bytes That Matter
 (optimize responses for each client)

       Leverage Concurrency
  (but abstract away its complexity)
Single Network Request from Clients
     (use LAN instead of WAN)




                     Device
                              Server


                                       Net?ix API
      landing page requires
       ~dozen API requests
Single Network Request from Clients
        (use LAN instead of WAN)




some clients are limited in the number of
   concurrent network connections
Single Network Request from Clients
         (use LAN instead of WAN)




network latency makes this even worse
(mobile, home, wi?, geographic distance, etc)
Single Network Request from Clients
     (use LAN instead of WAN)




                Device
                Server

                         Net?ix API




  push call pattern to server ...
Single Network Request from Clients
     (use LAN instead of WAN)




                Device
                Server

                         Net?ix API




 ... and eliminate redundant calls
Send Only The Bytes That Matter
         (optimize responses for each client)




                                              Net?ix API
                       Device
                                Server
Client                                   Client




           part of client now on server
Send Only The Bytes That Matter
              (optimize responses for each client)




                                                   Net?ix API
                            Device
                                     Server
     Client                                   Client




client retrieves and delivers exactly what their
       device needs in its optimal format
Send Only The Bytes That Matter
            (optimize responses for each client)




                    Device
                             Server
                                               Net?ix API

                                               Service Layer

Client                                Client




         interface is now a Java API that client
            interacts with at a granular level
Leverage Concurrency
         (but abstract away its complexity)




                Device
                         Server
                                           Net?ix API

                                           Service Layer

Client                            Client
Leverage Concurrency
         (but abstract away its complexity)




                Device
                         Server
                                           Net?ix API

                                           Service Layer

Client                            Client




   no synchronized, volatile, locks, Futures or
Atomic*/Concurrent* classes in client-server code
Leverage Concurrency
                         (but abstract away its complexity)

Service calls are    def video1Call = api.getVideos(api.getUser(), 123456, 7891234);
all asynchronous     def video2Call = api.getVideos(api.getUser(), 6789543);

                     // higher-order functions used to compose asynchronous calls together
                     wx.merge(video1Call, video2Call).toList().subscribe([
    Functional
                         onNext: {
  programming                listOfVideos ->
 with higher-order           for(video in listOfVideos) {
     functions                   response.getWriter().println("video: " + video.id + " " + video.title);
                             }
                         },
                         onError: {
                             exception ->
                             response.setStatus(500);
                             response.getWriter().println("Error: " + exception.getMessage());
                         }
                     ])




              Fully asynchronous API - Clients can¡¯t block
Device
                             Server

                                      Net?ix API




Optimize for each device. Leverage the server.
Net?ix is Hiring
                              http://jobs.net?ix.com




Fault Tolerance in a High Volume, Distributed System
     http://techblog.net?ix.com/2012/02/fault-tolerance-in-high-volume.html



          Making the Net?ix API More Resilient
    http://techblog.net?ix.com/2011/12/making-net?ix-api-more-resilient.html



              Why REST Keeps Me Up At Night
 http://blog.programmableweb.com/2012/05/15/why-rest-keeps-me-up-at-night/



                         Ben Christensen
                           @benjchristensen
                  http://www.linkedin.com/in/benjchristensen

More Related Content

Performance and Fault Tolerance for the Netflix API

  • 1. Performance and Fault Tolerance for the Net?ix API Ben Christensen Software Engineer ¨C API Platform at Net?ix @benjchristensen http://www.linkedin.com/in/benjchristensen http://techblog.net?ix.com/
  • 2. Net?ix API Dependency A Dependency B Dependency C Dependency D Dependency E Dependency F Dependency G Dependency H Dependency I Dependency J Dependency K Dependency L Dependency M Dependency N Dependency O Dependency P Dependency Q Dependency R
  • 3. Net?ix API Dependency A Dependency B Dependency C Dependency D Dependency E Dependency F Dependency G Dependency H Dependency I Dependency J Dependency K Dependency L Dependency M Dependency N Dependency O Dependency P Dependency Q Dependency R
  • 4. Dozens of dependencies. One going bad takes everything down. 99.99%30 = 99.7% uptime 0.3% of 1 billion = 3,000,000 failures 2+ hours downtime/month even if all dependencies have excellent uptime. Reality is generally worse.
  • 8. No single dependency should take down the entire app. Fallback. Fail silent. Fail fast. Shed load.
  • 9. Options Aggressive Network Timeouts Semaphores (Tryable) Separate Threads Circuit Breaker
  • 11. Tryable semaphores for ¡°trusted¡± clients and fallbacks Separate threads for ¡°untrusted¡± clients Aggressive timeouts on threads and network calls to ¡°give up and move on¡± Circuit breakers as the ¡°release valve¡±
  • 16. 30 rps x 0.2 seconds = 6 + breathing room = 10 threads Thread-pool Queue size: 5-10 (0 doesn't work but get close to it) Thread-pool Size + Queue Size Queuing is Not Free
  • 17. Cost of Thread @ 75rps median - 90th - 99th (time in ms) Time for thread to execute Time user thread waited
  • 19. Net?ix DependencyCommand Implementation Fallbacks Cache Eventual Consistency Stubbed Data Empty Response
  • 21. So, how does it work in the real world?
  • 22. Visualizing Circuits in Near-Realtime (latency is single-digit seconds, generally 1-2) Video available at https://vimeo.com/33576628
  • 23. Rolling 10 second counters 1 minute latency percentiles 2 minute rate change circle color and size represent health and traf?c volume
  • 24. API Daily Incoming vs Outgoing Weekend Weekend Weekend 8-10 Billion DependencyCommand Executions (threaded) 1.2 - 1.6 Billion Incoming Requests
  • 25. API Hourly Incoming vs Outgoing Peak at 700M+ threaded DependencyCommand executions (200k+/second) Peak at 100M+ incoming requests (30k+/second)
  • 27. Fallback. Fail silent. Fail fast. Shed load.
  • 28. Net?ix API Dependency A Dependency B Dependency C Dependency D Dependency E Dependency F Dependency G Dependency H Dependency I Dependency J Dependency K Dependency L Dependency M Dependency N Dependency O Dependency P Dependency Q Dependency R
  • 29. Single Network Request from Clients (use LAN instead of WAN) Send Only The Bytes That Matter (optimize responses for each client) Leverage Concurrency (but abstract away its complexity)
  • 30. Single Network Request from Clients (use LAN instead of WAN) Device Server Net?ix API landing page requires ~dozen API requests
  • 31. Single Network Request from Clients (use LAN instead of WAN) some clients are limited in the number of concurrent network connections
  • 32. Single Network Request from Clients (use LAN instead of WAN) network latency makes this even worse (mobile, home, wi?, geographic distance, etc)
  • 33. Single Network Request from Clients (use LAN instead of WAN) Device Server Net?ix API push call pattern to server ...
  • 34. Single Network Request from Clients (use LAN instead of WAN) Device Server Net?ix API ... and eliminate redundant calls
  • 35. Send Only The Bytes That Matter (optimize responses for each client) Net?ix API Device Server Client Client part of client now on server
  • 36. Send Only The Bytes That Matter (optimize responses for each client) Net?ix API Device Server Client Client client retrieves and delivers exactly what their device needs in its optimal format
  • 37. Send Only The Bytes That Matter (optimize responses for each client) Device Server Net?ix API Service Layer Client Client interface is now a Java API that client interacts with at a granular level
  • 38. Leverage Concurrency (but abstract away its complexity) Device Server Net?ix API Service Layer Client Client
  • 39. Leverage Concurrency (but abstract away its complexity) Device Server Net?ix API Service Layer Client Client no synchronized, volatile, locks, Futures or Atomic*/Concurrent* classes in client-server code
  • 40. Leverage Concurrency (but abstract away its complexity) Service calls are def video1Call = api.getVideos(api.getUser(), 123456, 7891234); all asynchronous def video2Call = api.getVideos(api.getUser(), 6789543); // higher-order functions used to compose asynchronous calls together wx.merge(video1Call, video2Call).toList().subscribe([ Functional onNext: { programming listOfVideos -> with higher-order for(video in listOfVideos) { functions response.getWriter().println("video: " + video.id + " " + video.title); } }, onError: { exception -> response.setStatus(500); response.getWriter().println("Error: " + exception.getMessage()); } ]) Fully asynchronous API - Clients can¡¯t block
  • 41. Device Server Net?ix API Optimize for each device. Leverage the server.
  • 42. Net?ix is Hiring http://jobs.net?ix.com Fault Tolerance in a High Volume, Distributed System http://techblog.net?ix.com/2012/02/fault-tolerance-in-high-volume.html Making the Net?ix API More Resilient http://techblog.net?ix.com/2011/12/making-net?ix-api-more-resilient.html Why REST Keeps Me Up At Night http://blog.programmableweb.com/2012/05/15/why-rest-keeps-me-up-at-night/ Ben Christensen @benjchristensen http://www.linkedin.com/in/benjchristensen

Editor's Notes

  1. \n
  2. The Netflix API serves all streaming devices and acts as the broker between backend Netflix systems and the user interfaces running on the 800+ devices that support Netflix streaming. \n\nMore than 1 billion incoming calls per day are received which in turn fans out to several billion outgoing calls (averaging a ratio of 1:7) to dozens of underlying subsystems with peaks of over 200k dependency requests per second. \n
  3. First half of the presentation discusses resilience engineering implemented to handle failure and latency at the integration points with the various dependencies. \n
  4. Even when all dependencies are performing well the aggregate impact of even 0.01% downtime on each of dozens of services equates to potentially hours a month of downtime if not engineered for resilience. \n
  5. \n
  6. \n
  7. \n
  8. It is a requirement of high volume, high availability applications to build fault and latency tolerance into their architecture and not expect infrastructure to solve it for them. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. Sample of 1 dependency circuit for 12 hours from production cluster with a rate of 75rps on a single server. \n\nEach execution occurs in a separate thread with median, 90th and 99th percentile latencies shown in the first 3 legend values. \n\nThe calling thread median, 90th and 99th percentiles are the last 3 legend values. \n\nThus, the median cost of the thread is 1.62ms - 1.57ms = 0.05ms, at the 90th it is 4.57-2.05 = 2.52ms. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. Second half of the presentation discusses architectural changes to enable optimizing the API for each Netflix device as opposed to a generic one-size-fits-all API which treats all devices the same. \n
  29. Netflix has over 800 unique devices that fall into several dozens classes with unique user experiences, different calling patterns, capabilities and needs from the data and thus the API. \n
  30. The one-size-fits-all API results in chatty clients, some requiring ~dozen requests to render a page. \n
  31. \n
  32. \n
  33. The client should make a single request and push the 'chatty' part to the server where low-latency networks and multi-core servers can perform the work far more efficiently. \n
  34. \n
  35. The client now extends over the network barrier and runs a portion in the server itself. The client sends requests over HTTP to its other half running in the server which then can access a Java API at a very granular level to access exactly what it needs and return an optimized response suited to the devices exact requirements and user experience. \n
  36. The client now extends over the network barrier and runs a portion in the server itself. The client sends requests over HTTP to its other half running in the server which then can access a Java API at a very granular level to access exactly what it needs and return an optimized response suited to the devices exact requirements and user experience. \n
  37. The client now extends over the network barrier and runs a portion in the server itself. The client sends requests over HTTP to its other half running in the server which then can access a Java API at a very granular level to access exactly what it needs and return an optimized response suited to the devices exact requirements and user experience. \n
  38. Concurrency is abstracted away behind an asynchronous API and data is retrieved, transformed and composed using high-order-functions (such as map, mapMany, merge, zip, take, toList, etc). Groovy is used for its closure support that lends itself well to the functional programming style. \n
  39. Concurrency is abstracted away behind an asynchronous API and data is retrieved, transformed and composed using high-order-functions (such as map, mapMany, merge, zip, take, toList, etc). Groovy is used for its closure support that lends itself well to the functional programming style. \n
  40. Concurrency is abstracted away behind an asynchronous API and data is retrieved, transformed and composed using high-order-functions (such as map, mapMany, merge, zip, take, toList, etc). Groovy is used for its closure support that lends itself well to the functional programming style. \n
  41. The Netflix API is becoming a platform that empowers user-interface teams to build their own API endpoints that are optimized to their client applications and devices.\n
  42. \n