際際滷

際際滷Share a Scribd company logo
Embracing Failure
(not my life story)
Embracing Failure
Embracing Failure
Setting the Mood
Understand that they WILL
happen
Failures are not binary
Impact determines importance
deadlines for fixes are variable
Terminology
Website
Production
Downtime
Monitor Failures
What is Monitoring?
Graphs. Everywhere.
Alerts on failures
phone calls
texts
Answers: Are we failing?
Embracing Failure
healthcare.gov
Know when youre down
before CNN
Embracing Failure
Postmortems
(fool me once. shame on you.
fool me twice. shame on me.)
Postmortems
1. Reconstruct the factual
timeline
2. Root cause analysis
3. Remediation items
Postmortems
Why did we fail?
Blameless
Moderated
Gamedays
(You wouldnt wing a talk.
Dont wing a hot fix)
Gameday
Best defense is a good
offense
Simulate possible failures
Do it in production
kill -9
1. Draw a block
diagram
2. Cut every connection
3. Watch the fireworks
SafeMachine
(like a state machine  but safer)
Try, Try, Try again
What if we could just retry
failures?
Side effects are the root of all
evil
Safe failures vs Unsafe failures
Whats in a SafeMachine
Actions
States
START
Computed
File
Uploaded
File
END
compute upload
record
successful
initialize_succeeded
initialize_failed
initialize_inprogress
computed_succeeded
START
a1
a1
a2
a2
a2
a3
a3
a3
END
The Pipeline
The Pipeline
START
Computed
File
Uploaded
File
END
Safe Unsafe Safe
Embracing Failure
Monitor
Postmortems
Gamedays - you wouldnt
wing a talk?
SafeMachine
@chriswu_
Additional resources
 Postmortems https://codeascraft.com/2012/05/22/blameless-
postmortems/
 Gamedays - https://stripe.com/blog/game-day-exercises-at-stripe
 links at the bottom of this post are also great
 Error Tracking - https://getsentry.com/welcome/

More Related Content

Embracing Failure