際際滷

際際滷Share a Scribd company logo
THE CASE FOR CHAOS TESTING
Peter Lamar
VP Cloud & Developer Relations
COMPLEXITY
COMPLEXITY
Its easy to build software that becomes complex
- Fight complexity one line at a time
COMPLEXITY
However, even the best systems
grow in complexity over time
COMPLEXITY
Systems will grow in
complexity, which will raise
the likelihood of failure
unless steps are taken to
mitigate and manage
complexity
COMPLEXITY
Q. Do more professional, larger
and better funded teams have
less failure? i.e. Imposter
syndrome?
A. Lets find out at
https://outage.report
(*Screenshot 8/22/19)
COMPLEXITY
A. Lets have controlled
experiments where we try to
identify weakness before it
crashes the system.
Q. Wow, all that failure must be
expensive! Is there any way we
can be more confident we are not
going to fail?
COMPLEXITY
Sure, but 'controlled experiments' is a
lame name. Lets call it 'Chaos
Engineering'! Way cooler
COMPLEXITY
Chaos engineering is like a vaccine, which injects
a small amount of virus to build immunity
CHAOS
ENGINEERING
The harder it is to disrupt the steady state, the more
confidence we have in the behavior of the system. If
a weakness is uncovered, we now have a target for
improvement before that behavior manifests in the
system at large.
-https://principlesofchaos.org
CHAOS ENGINEERING  EXPECTED BENEFITS
Less downtime, better user experience
Less alarms and alerts (i.e. burnout) to
Operations/SRE/Development teams
More productivity from less unplanned outages
Spreading knowledge of application to the team
CHAOS ENGINEERING
1. Define the
normal/steady state of
the system (monitor
system and business
metrics)
Hypothesis that steady
state will continue in
both control and
experiment groups
2. Pseudo-randomly
inject faults (kill
containers, network, etc)
simulating real world
events
Try to disprove
hypothesis looking for
difference in control
and experiment groups
GAMEDAY EXAMPLE  PER SERVICE
Failure Scenario Experiment Scoping Signals/Metrics
Abort
Conditions
SAMPLE:
Application server
latency Latency
100ms - 1000ms
 2500ms
Service
Availability, On
call paging
SLA breach, RPS
threshold reach
Expected Outcome Actual Outcome Bugs
Application should still
be available, but slower Total app failure Fallback did not occur
CNCF LANDSCAPE
LIVE DEMO - SIMPLE CONTAINER EXPERIMENTS
Container Chaos experiment with Pumba
* Network delays!
LIVE DEMO  ADVANCED CONTAINER
EXPERIMENTS
More advanced Chaos with Chaos Blade
* More Network Delays!
CHAOS ENGINEERING  SIMPLIFIED REVIEW
Have a hypothesis
and identify control
and experimental
group
01
Use real-world
events & limit scope
02
Make it as real as
possible, ideally
Production
03
Look for differences
in steady state
between control
and experimental
group
04
EXAMPLE
EXPERIMENTS
Test Test slowdown of key services to identify
dependency slowdown/intermittent failure
Test
Test failure for an Availability Zone/Region (Rack
if on premise) to identify group component
failover resiliency
Test Test failure of a stack component to identify
resiliency of individual components
Test cluster failure with Litmus
ADDITIONAL TIPS
Start Small
Production if
possible, or
close to it
Minimize
blast radius
Have an
emergency
stop
QUESTIONS?

More Related Content

The case for chaos testing

  • 1. THE CASE FOR CHAOS TESTING Peter Lamar VP Cloud & Developer Relations
  • 3. COMPLEXITY Its easy to build software that becomes complex - Fight complexity one line at a time
  • 4. COMPLEXITY However, even the best systems grow in complexity over time
  • 5. COMPLEXITY Systems will grow in complexity, which will raise the likelihood of failure unless steps are taken to mitigate and manage complexity
  • 6. COMPLEXITY Q. Do more professional, larger and better funded teams have less failure? i.e. Imposter syndrome? A. Lets find out at https://outage.report (*Screenshot 8/22/19)
  • 7. COMPLEXITY A. Lets have controlled experiments where we try to identify weakness before it crashes the system. Q. Wow, all that failure must be expensive! Is there any way we can be more confident we are not going to fail?
  • 8. COMPLEXITY Sure, but 'controlled experiments' is a lame name. Lets call it 'Chaos Engineering'! Way cooler
  • 9. COMPLEXITY Chaos engineering is like a vaccine, which injects a small amount of virus to build immunity
  • 10. CHAOS ENGINEERING The harder it is to disrupt the steady state, the more confidence we have in the behavior of the system. If a weakness is uncovered, we now have a target for improvement before that behavior manifests in the system at large. -https://principlesofchaos.org
  • 11. CHAOS ENGINEERING EXPECTED BENEFITS Less downtime, better user experience Less alarms and alerts (i.e. burnout) to Operations/SRE/Development teams More productivity from less unplanned outages Spreading knowledge of application to the team
  • 12. CHAOS ENGINEERING 1. Define the normal/steady state of the system (monitor system and business metrics) Hypothesis that steady state will continue in both control and experiment groups 2. Pseudo-randomly inject faults (kill containers, network, etc) simulating real world events Try to disprove hypothesis looking for difference in control and experiment groups
  • 13. GAMEDAY EXAMPLE PER SERVICE Failure Scenario Experiment Scoping Signals/Metrics Abort Conditions SAMPLE: Application server latency Latency 100ms - 1000ms 2500ms Service Availability, On call paging SLA breach, RPS threshold reach Expected Outcome Actual Outcome Bugs Application should still be available, but slower Total app failure Fallback did not occur
  • 15. LIVE DEMO - SIMPLE CONTAINER EXPERIMENTS Container Chaos experiment with Pumba * Network delays!
  • 16. LIVE DEMO ADVANCED CONTAINER EXPERIMENTS More advanced Chaos with Chaos Blade * More Network Delays!
  • 17. CHAOS ENGINEERING SIMPLIFIED REVIEW Have a hypothesis and identify control and experimental group 01 Use real-world events & limit scope 02 Make it as real as possible, ideally Production 03 Look for differences in steady state between control and experimental group 04
  • 18. EXAMPLE EXPERIMENTS Test Test slowdown of key services to identify dependency slowdown/intermittent failure Test Test failure for an Availability Zone/Region (Rack if on premise) to identify group component failover resiliency Test Test failure of a stack component to identify resiliency of individual components
  • 19. Test cluster failure with Litmus
  • 20. ADDITIONAL TIPS Start Small Production if possible, or close to it Minimize blast radius Have an emergency stop