4. Humans are part of any system
Initial design, ongoing improvements
Maintenance
Upgrades
Issues, Incident response
Humans in DevOps
5. System issues = error rates + SLA + ...
Human issues = alerts out of hours + interruptions + .
System issues = Human issues
Human issues = system issues
6. System health impacts human health
Human health impacts system health
Humans impact systems
7. Downtime = loss of users, reputation, revenue
Downtime caused by unreliable systems
Unhealthy teams reduce reliability
Unhealthy teams = loss of users, reputation, revenue
Humans impact business
8. Slip
Lapse
Mistake
Violation
(Always, again, again)
Human risk
9. Prepare and practice
Respond
Postmortem
Expect downtime
11. Power failure to half of our servers
Automated failover unavailable
(known failure condition)
Manual DNS switch required
Expected impact: 20 min
Actual impact: 43min
Incident example
16. First responder, acknowledge alert
Load incident response checklist
Log into #ops-war-room in Slack
Log incident into JIRA
Begin investigation
General response process
18. The limits of human memory and
attention
Complexity
Stress and fatigue
Ego
Pilots, doctors, divers:
Bruce Willis Ruins All Films
(BCD, weights, releases, air, final)
Pre-flight checklists
19. 1. Extended use of checklists
2. Not to follow blindly, use knowledge
and experience
3. Independent system
4. Searchable
5. List of known issues and
documented workarounds/fixes
Documented procedures
20. Replica environment
or mock command line
Record actions and timing
Multiple failures
Unexpected results
Realistic scenarios: War Games