際際滷

際際滷Share a Scribd company logo
Jorge Salamero Sanz <jsalamero@serverdensity.com>
IncontroDevOps 1 April 2016
War Games - Flight training for DevOps
How to Monitor MySQL
 Infrastructure automation
 Configuration automation
 Continuous testing
 Continuous deployment / delivery
 Monitoring
 Logs, error handling
 Feedback
 Human Ops
DevOps lifecycle
 Humans are part of any system
 Initial design, ongoing improvements
 Maintenance
 Upgrades
 Issues, Incident response
Humans in DevOps
 System issues = error rates + SLA + ...
 Human issues = alerts out of hours + interruptions + .
 System issues = Human issues
Human issues = system issues
 System health impacts human health
 Human health impacts system health
Humans impact systems
 Downtime = loss of users, reputation, revenue
 Downtime caused by unreliable systems
 Unhealthy teams reduce reliability
 Unhealthy teams = loss of users, reputation, revenue
Humans impact business
 Slip
 Lapse
 Mistake
 Violation
 (Always, again, again)
Human risk
 Prepare and practice
 Respond
 Postmortem
Expect downtime
Real example
(small war story, wont be long)
 Power failure to half of our servers
 Automated failover unavailable
(known failure condition)
 Manual DNS switch required
 Expected impact: 20 min
 Actual impact: 43min
Incident example
Flight training for DevOps & HumanOps - IncontroDevOps 2016
Lesson learned?
 Unfamiliarity with the process
 Pressure of time sensitive event
(panic effect)
 Escalation introduces delays
The Human Factor
Handling the Human factor
 First responder, acknowledge alert
 Load incident response checklist
 Log into #ops-war-room in Slack
 Log incident into JIRA
 Begin investigation
General response process
1. Extended use of checklists
Documented procedures
 The limits of human memory and
attention
 Complexity
 Stress and fatigue
 Ego
 Pilots, doctors, divers:
Bruce Willis Ruins All Films
(BCD, weights, releases, air, final)
Pre-flight checklists
1. Extended use of checklists
2. Not to follow blindly, use knowledge
and experience
3. Independent system
4. Searchable
5. List of known issues and
documented workarounds/fixes
Documented procedures
 Replica environment
 or mock command line
 Record actions and timing
 Multiple failures
 Unexpected results
Realistic scenarios: War Games
Results
 Team and individual test of response
 Run real commands
 Training the people
 Training the procedures
 Training the tools
Results
 Increase confidence
 Reduce panic
 Better coordination
 Trust relationships
 Improves time to resolution
Humans results
 Review
 Suggestions for improvements
 Do it again
 Scenario evolves
 People forget
loop(): review and repeat
 On call rotation design
 Alert prioritization
 Notification optimization
What else?
Human Ops
1. Humans are part of the system
2. Humans impact systems
3. Humans impact business
4. Human issues count as system issues
Human Ops principles
meetup.com/humanops-london/
Human Ops Meetup
www.CloudStatusApp.com
Jorge Salamero Sanz
Chief Developer Advocate
@bencerillo
@serverdensity
our DevOps stories
blog.serverdensity.com

More Related Content

Flight training for DevOps & HumanOps - IncontroDevOps 2016

  • 1. Jorge Salamero Sanz <jsalamero@serverdensity.com> IncontroDevOps 1 April 2016 War Games - Flight training for DevOps
  • 3. Infrastructure automation Configuration automation Continuous testing Continuous deployment / delivery Monitoring Logs, error handling Feedback Human Ops DevOps lifecycle
  • 4. Humans are part of any system Initial design, ongoing improvements Maintenance Upgrades Issues, Incident response Humans in DevOps
  • 5. System issues = error rates + SLA + ... Human issues = alerts out of hours + interruptions + . System issues = Human issues Human issues = system issues
  • 6. System health impacts human health Human health impacts system health Humans impact systems
  • 7. Downtime = loss of users, reputation, revenue Downtime caused by unreliable systems Unhealthy teams reduce reliability Unhealthy teams = loss of users, reputation, revenue Humans impact business
  • 8. Slip Lapse Mistake Violation (Always, again, again) Human risk
  • 9. Prepare and practice Respond Postmortem Expect downtime
  • 10. Real example (small war story, wont be long)
  • 11. Power failure to half of our servers Automated failover unavailable (known failure condition) Manual DNS switch required Expected impact: 20 min Actual impact: 43min Incident example
  • 14. Unfamiliarity with the process Pressure of time sensitive event (panic effect) Escalation introduces delays The Human Factor
  • 16. First responder, acknowledge alert Load incident response checklist Log into #ops-war-room in Slack Log incident into JIRA Begin investigation General response process
  • 17. 1. Extended use of checklists Documented procedures
  • 18. The limits of human memory and attention Complexity Stress and fatigue Ego Pilots, doctors, divers: Bruce Willis Ruins All Films (BCD, weights, releases, air, final) Pre-flight checklists
  • 19. 1. Extended use of checklists 2. Not to follow blindly, use knowledge and experience 3. Independent system 4. Searchable 5. List of known issues and documented workarounds/fixes Documented procedures
  • 20. Replica environment or mock command line Record actions and timing Multiple failures Unexpected results Realistic scenarios: War Games
  • 22. Team and individual test of response Run real commands Training the people Training the procedures Training the tools Results
  • 23. Increase confidence Reduce panic Better coordination Trust relationships Improves time to resolution Humans results
  • 24. Review Suggestions for improvements Do it again Scenario evolves People forget loop(): review and repeat
  • 25. On call rotation design Alert prioritization Notification optimization What else?
  • 27. 1. Humans are part of the system 2. Humans impact systems 3. Humans impact business 4. Human issues count as system issues Human Ops principles
  • 30. Jorge Salamero Sanz Chief Developer Advocate @bencerillo @serverdensity our DevOps stories blog.serverdensity.com