5. COMPLEXITY
Systems will grow in
complexity, which will raise
the likelihood of failure
unless steps are taken to
mitigate and manage
complexity
6. COMPLEXITY
Q. Do more professional, larger
and better funded teams have
less failure? i.e. Imposter
syndrome?
A. Lets find out at
https://outage.report
(*Screenshot 8/22/19)
7. COMPLEXITY
A. Lets have controlled
experiments where we try to
identify weakness before it
crashes the system.
Q. Wow, all that failure must be
expensive! Is there any way we
can be more confident we are not
going to fail?
10. CHAOS
ENGINEERING
The harder it is to disrupt the steady state, the more
confidence we have in the behavior of the system. If
a weakness is uncovered, we now have a target for
improvement before that behavior manifests in the
system at large.
-https://principlesofchaos.org
11. CHAOS ENGINEERING EXPECTED BENEFITS
Less downtime, better user experience
Less alarms and alerts (i.e. burnout) to
Operations/SRE/Development teams
More productivity from less unplanned outages
Spreading knowledge of application to the team
12. CHAOS ENGINEERING
1. Define the
normal/steady state of
the system (monitor
system and business
metrics)
Hypothesis that steady
state will continue in
both control and
experiment groups
2. Pseudo-randomly
inject faults (kill
containers, network, etc)
simulating real world
events
Try to disprove
hypothesis looking for
difference in control
and experiment groups
13. GAMEDAY EXAMPLE PER SERVICE
Failure Scenario Experiment Scoping Signals/Metrics
Abort
Conditions
SAMPLE:
Application server
latency Latency
100ms - 1000ms
2500ms
Service
Availability, On
call paging
SLA breach, RPS
threshold reach
Expected Outcome Actual Outcome Bugs
Application should still
be available, but slower Total app failure Fallback did not occur
15. LIVE DEMO - SIMPLE CONTAINER EXPERIMENTS
Container Chaos experiment with Pumba
* Network delays!
16. LIVE DEMO ADVANCED CONTAINER
EXPERIMENTS
More advanced Chaos with Chaos Blade
* More Network Delays!
17. CHAOS ENGINEERING SIMPLIFIED REVIEW
Have a hypothesis
and identify control
and experimental
group
01
Use real-world
events & limit scope
02
Make it as real as
possible, ideally
Production
03
Look for differences
in steady state
between control
and experimental
group
04
18. EXAMPLE
EXPERIMENTS
Test Test slowdown of key services to identify
dependency slowdown/intermittent failure
Test
Test failure for an Availability Zone/Region (Rack
if on premise) to identify group component
failover resiliency
Test Test failure of a stack component to identify
resiliency of individual components