2. Overview
What are the common causes of service outages?
Lessons learned from production incidents
Patterns that we use to make our services resilient
2
3. About me
Software engineer at OpenTable
Own >20 microservices in production
Currently on-call
3
4. About OpenTable
Connecting 40k restaurants to 21m diners each month
>2000 GitHub repositories
Hundreds of microservices
4
5. Developer looking at logs after a production
outage
Sir Joseph Noel Paton
Oil on Canvas, 1861
5Source: http://classicprogrammerpaintings.com/
7. Simple Testing Can Prevent Most Critical Failures
... we also found that for the most catastrophic failures, almost all of them are
caused by incorrect error handling, and 58% of them are trivial mistakes or can be
exposed by statement coverage testing.
http://dl.acm.org/citation.cfm?id=2685068
7
8. 8
Release It!: Design and
Deploy Production-Ready
Software.
Michael T. Nygard
Drift into Failure:
From Hunting Broken
Components to
Understanding
Complex Systems.
Sidney Dekker
Site Reliability
Engineering: How
Google Runs Production
Systems.
Betsy Beyer, Chris Jones,
Jennifer Petoff, Niall
Richard Murphy
Systems Performance:
Enterprise and the Cloud.
Brendan Gregg
9. Designing for Failure
Complex systems are rife with failure and are resistant to top-down control
Moving from eliminating failure to anticipating failure in every component
Software should be prepared for real-world production challenges and not
require constant life-support and human intervention
Build systems and organizations that improve over time, rather than just not
degrade
Design for failure and operate to learn
9
10. Reliability vs Resilience
Reliability:
Stiff boundaries, layers
Defense in depth
Redundancy
Interference protection
Assurance
Accountability
10
Resilience:
Withstand transients
Recover swiftly and smoothly
Prioritize to serve high-level goals
Recognize and respond to
anomalies
Adapt to change
Source: Cook 2012
11. Failure Modes
Failure is comprised of a chain of cracks in the system: a failure mode
High levels of complexity provide more directions for the cracks to propagate
Tightly coupled architectures increase the chance of propagation
At each step in the chain, the crack can be accelerated, slowed, or stopped
Design failure modes that drive failures away from indispensable features
11 Source: Nygard 2007
12. Bulkheads
In a ship bulkheads create watertight compartments, restrict fires, separate cargo
Partition your systems to keep failure in one part from destroying everything
Requires more precise capacity planning
12
13. Waiting for the server response
Victor Vasnetsov, 1898
Oil, canvas
13Source: http://classicprogrammerpaintings.com/
14. Timeouts
Never ever block forever
Set a timeout on any operation that can block threads
Prefer queue-and-retry to synchronous retries and use circuit breakers
Dont forget to clean up resources after a timeout happened
How high should timeouts be? Try 99.9% response time
14
15. Resource Pools
Pool and reuse resources whenever possible to increase efficiency, isolate
failure, limit concurrency, separate workloads
Prefer several smaller pools to one large pool
Keep pool size as small as possible
15
16. Fail Fast
Slow failure responses tie up capacity, waste system resources, and cascade
If the system can determine in advance that it will fail at an operation, its
always better to fail fast
Check that all resources are available and healthy before beginning a
transaction
16
17. Load Shedding
Define operational limits of your system and withstand excessive load spikes
Shed load by rejecting excessive requests, executing a fallback method,
returning static data, or applying backpressure to the caller
Explicit backpressure with handshaking to signal to callers that a service is
overloaded
Implicit backpressure using blocking synchronous calls, semaphores, TCP
protocol
17
18. Circuit Breakers
Electrical fuses: detect excess usage, fail first, and open the circuit
Wrap dangerous operations with a component that can circumvent calls when
the system is not healthyopposite of retries
Automatically open the circuit when error threshold is exceeded, provide a
fallback mechanism, autorecover when system heals
Degrade application functionality in response to failure
18
22. Queuing Effects
In every system, exactly one constraint determines the systems capacity
Once it is reached, all other parts of the system will queue up or drop work
Response time = processing time + latency (time spent in the queue)
In practice queues are only found in two states: empty or full
22
23. Graceful Degradation
Define features that your service absolutely needs to provide
Route failure modes away from the critical path of these features
Feature flags to shut down parts of your service
23
24. Steady State
The system should be able to run indefinitely without human intervention
Typical interventions: manual disk cleanups, nightly restarts
For every mechanism that accumulates a resource, some other mechanism
must recycle that resource
Purge old DB data, rotate log files, expire cache, decommission infrastructure
24
25. Service Autonomy
Expose yourself to latency as rarely as possible
Use asynchronous communication to reduce temporal coupling
No synchronous calls to other services on the request path
No transactions that span multiple services
Prefetch and cache reads, queue writes
Eventsourcing and CQRS
SOA Saga Pattern and Service Choreography
25
Source: Dahan 2006
26. Separation of Concerns
Separate gateways to third parties are from the main transaction services
API gateways can be used to implement load shedding, timeouts, circuit breakers,
handshaking, failure-injection
Also a good place for security, metrics, logging, and other cross-cutting concerns
Particularly valuable for legacy systems
26
27. Understand your Platform
You dont have to be an engineer to be be a racing driver, but you do have to have
Mechanical Sympathy.
Understand how hardware, OS, and VM work in order to create efficient software
Abstractions are leaky: CPU cores, caches, RAM, HDD, network, JVM, GC, thread
affinity, Docker, virtualization, data structures
27
Source: Mechanical Sympathy
28. Test for Failure
Build test harnesses that can provoke socket, protocol, and application errors
Run longevity tests using realistic data volumes to find steady state violations
Use traffic bursts to measure latency variance and queuing
Make failure a first-class citizen: game days, chaos monkey, failure injection
28
29. Incident Response
Define an incident management framework
Clear understanding of responsibilities: command, operational work,
communication, planning
Follow a systematic troubleshooting process
Blameless postmortems
29
30. Antifragile Organization
Banishing error also banishes innovation and adaptation
Trade the precise robustness of complicated systems for the sloppy resilience
of complex systems
Remove organizational scar tissue, clean out, automate, reduce handoffs
Diversity, loose coupling, slack, decentralized anticipation, communication
Operational discretion at lower levels in the organization
Regulation, compliance, oversight, inspection are mismatched to complexity
30 Source: Dekker 2011
31. Github Major Service Outage
Georges Seurat, 1884
Oil on canvas
31Source: http://classicprogrammerpaintings.com/
32. References
Nygard, M. T. (2007). Release It!: Design and Deploy Production-Ready Software (Pragmatic Programmers)
Dekker, S. (2011). Drift into Failure: From Hunting Broken Components to Understanding Complex Systems
Beyer, B., Jones, C., Petoff, J., Murphy, N.R. (2016). Site Reliability Engineering: How Google Runs Production Systems
Cook, R. (2012). How Complex Systems Fail
Holtman, J., & Gunther, N. J. (2008). Getting in the Zone for Successful Scalability
Gunther, N. (2010). Quantifying Scalability FTW
Schwartz, B. (2015). Everything You Need To Know About Queueing Theory
Herbert, F. (2014). Planning for Overload
Thompson, M. (2012). Applying Back Pressure When Overloaded
Dean, J. (2012). Achieving Rapid Response Times in Large Online Services
Dahan, U. (2006). Autonomous Services and Enterprise Entity Aggregation
Thompson, M. Mechanical Sympathy
Rasmussen, J. (1997). Risk management in a dynamic society: a modelling problem
Andrus, K. (2015). Breaking Bad at Netflix: Building Failure as a Service
Tarjan, P. (2017). Scaling your API with rate limiters
32