This document discusses human processes for increasing site reliability at Google. It describes the roles of software engineers (SWEs) and site reliability engineers (SREs). SWEs focus on designing, coding, and provisioning systems, while SREs focus on short-term preventing of outages through on-call response and system administration. The document outlines several key human processes used at Google to spread knowledge and improve reliability, including design reviews, code reviews, knowledge externalization, manual response protocols, production checklists, and post-mortem analyses of outages. The overall goal is to establish reliable systems and processes that can scale through widespread knowledge sharing and predictability.
This document discusses human processes for increasing site reliability at Google. It describes the roles of software engineers (SWEs) and site reliability engineers (SREs). SWEs focus on designing, coding, and provisioning systems, while SREs focus on short-term preventing of outages through on-call response and system administration. The document outlines several key human processes used at Google to spread knowledge and improve reliability, including design reviews, code reviews, knowledge externalization, manual response protocols, production checklists, and post-mortem analyses of outages. The overall goal is to establish reliable systems and processes that can scale through widespread knowledge sharing and predictability.