�ݺ�ߣ

Performance and Fault Tolerance
for the Net?ix API
Ben Christensen
Software Engineer �C API Platform at Net?ix
@benjchristensen
http://www.linkedin.com/in/benjchristensen

http://techblog.net?ix.com/

Net?ix API

Dependency A Dependency B Dependency C

Dependency D Dependency E Dependency F

Dependency G Dependency H Dependency I

Dependency J Dependency K Dependency L

Dependency M Dependency N Dependency O

Dependency P Dependency Q Dependency R

Dozens of dependencies.

One going bad takes everything down.

99.99%30 = 99.7% uptime
0.3% of 1 billion = 3,000,000 failures

2+ hours downtime/month
even if all dependencies have excellent uptime.

Reality is generally worse.

Performance and Fault Tolerance for the Netflix API

No single dependency should
take down the entire app.

Fallback.
Fail silent.
Fail fast.

Shed load.

Options

Aggressive Network Timeouts

Semaphores (Tryable)

Separate Threads

Circuit Breaker

Tryable semaphores for ��trusted�� clients and fallbacks

Separate threads for ��untrusted�� clients

Aggressive timeouts on threads and network calls
to ��give up and move on��

Circuit breakers as the ��release valve��

30 rps x 0.2 seconds = 6 + breathing room = 10 threads

Thread-pool Queue size: 5-10 (0 doesn't work but get close to it)

Thread-pool Size + Queue Size

Queuing is Not Free

Cost of Thread @ 75rps
median - 90th - 99th (time in ms)

Time for thread to execute Time user thread waited

Net?ix DependencyCommand Implementation

Net?ix DependencyCommand Implementation

Fallbacks

Cache
Eventual Consistency
Stubbed Data
Empty Response

So, how does it work in the real world?

Visualizing Circuits in Near-Realtime
(latency is single-digit seconds, generally 1-2)

Video available at
https://vimeo.com/33576628

Rolling 10 second counters

1 minute latency percentiles

2 minute rate change

circle color and size represent
health and traf?c volume

API Daily Incoming vs Outgoing

Weekend Weekend Weekend

8-10 Billion DependencyCommand Executions (threaded)

1.2 - 1.6 Billion Incoming Requests

API Hourly Incoming vs Outgoing

Peak at 700M+ threaded DependencyCommand executions (200k+/second)

Peak at 100M+ incoming requests (30k+/second)

Fallback.
Fail silent.
Fail fast.

Shed load.

Single Network Request from Clients
(use LAN instead of WAN)

Send Only The Bytes That Matter
(optimize responses for each client)

Leverage Concurrency
(but abstract away its complexity)


Device
Server

Net?ix API
landing page requires
~dozen API requests


some clients are limited in the number of
concurrent network connections


network latency makes this even worse
(mobile, home, wi?, geographic distance, etc)


Device
Server

Net?ix API

push call pattern to server ...


Device
Server

Net?ix API

... and eliminate redundant calls


Net?ix API
Device
Server
Client Client

part of client now on server


Net?ix API
Device
Server
Client Client

client retrieves and delivers exactly what their
device needs in its optimal format


Device
Server
Net?ix API

Service Layer

Client Client

interface is now a Java API that client
interacts with at a granular level


Device
Server
Net?ix API

Service Layer

Client Client


Device
Server
Net?ix API

Service Layer

Client Client

no synchronized, volatile, locks, Futures or
Atomic*/Concurrent* classes in client-server code


Service calls are def video1Call = api.getVideos(api.getUser(), 123456, 7891234);
all asynchronous def video2Call = api.getVideos(api.getUser(), 6789543);

// higher-order functions used to compose asynchronous calls together
wx.merge(video1Call, video2Call).toList().subscribe([
Functional
onNext: {
programming listOfVideos ->
with higher-order for(video in listOfVideos) {
functions response.getWriter().println("video: " + video.id + " " + video.title);
}
},
onError: {
exception ->
response.setStatus(500);
response.getWriter().println("Error: " + exception.getMessage());
}
])

Fully asynchronous API - Clients can��t block

Device
Server

Net?ix API

Optimize for each device. Leverage the server.

Net?ix is Hiring
http://jobs.net?ix.com

Fault Tolerance in a High Volume, Distributed System
http://techblog.net?ix.com/2012/02/fault-tolerance-in-high-volume.html

Making the Net?ix API More Resilient
http://techblog.net?ix.com/2011/12/making-net?ix-api-more-resilient.html

Why REST Keeps Me Up At Night
http://blog.programmableweb.com/2012/05/15/why-rest-keeps-me-up-at-night/

Ben Christensen
@benjchristensen
http://www.linkedin.com/in/benjchristensen

�ݺ�ߣ

Performance and Fault Tolerance for the Netflix API

More Related Content

Performance and Fault Tolerance for the Netflix API

Editor's Notes