From the course: DevOps Foundations: Effective Postmortems

Blamelessness

From the course: DevOps Foundations: Effective Postmortems

Start my 1-month free trial

Blamelessness

- [Instructor] We learned from resilience engineering that people are part of what generates safety, not a threat to it. There was a US Institute of Medicine project published in 1999 called "To Err is Human: Building a Safer Health System" that investigated the large number of deaths per year from preventable medical errors in the United States. But the report doesn't just blame the healthcare practitioners making the errors and move on. It states in its preface: Human beings, in all lines of work, make errors. Errors can be prevented by designing systems that make it hard for people to do the wrong thing and easy for people to do the right thing. It recognizes that creating a resilient system, not blaming those who are around when there's a problem, is the key to fixing a safety issue. Professor Sidney Dekker in his book, "The Field Guide to Understanding 'Human Error,'" explains that people take actions they believe are reasonable at the time. Judging those decisions or actions is faulty in retrospect if your thought process stops there, deliberately allows the conditions that led a reasonable person to make a major error to persist. Research has found that postmortem methodologies centered on individual blame end up fostering inertia and do not effectively eliminate risk, while those based on analyzing the entire organization or system do generate improvement. So how does this inform how we do our postmortems? The operations team at Etsy first brought blameless postmortems into the mainstream of DevOps. Etsy's own John Allspaw argued that the best method for their organization was for all the engineers who contributed to an incident to be able to give a full account of all their thoughts, expectations, assumptions, and actions without fear of punishment or retribution. Because basically, if postmortems are witch hunts for blame, then engineers will clam up and not share vital information about the incident to avoid being reprimanded. This prevents the organization from learning and makes it more likely that problem or similar problems will recur. As a result, blamelessness is a keystone of how DevOps teams conduct postmortems. For example, the Google Site Reliability Engineering book describes Google's postmortem philosophy as rooted in blamelessness. It says: For a postmortem to be truly blameless, it must focus on identifying the contributing causes of the incident without indicting any individual or team for bad or inappropriate behavior. A blamelessly written postmortem assumes that everyone involved in an incident had good intentions and did the right thing with the information they had. Google's SRE Workbook goes on to say that a truly blameless postmortem culture results in more reliable systems. Remember, postmortems are a valuable opportunity. These mistakes and problems give us insight in to how our systems really work and provide us a chance to improve them. You can't make people perfect, but you can help them to be more safe by tuning your systems and processes to create safe behavior. You can make a postmortem blameless by focusing on how an incident could have happened instead of why a particular person made a specific mistake. Focus on how they normally do their jobs, what they see as normal, how they experienced and interpreted things during the incident. The details underneath that are what you can change to effect your system's resilience. As we move forward into the details of conducting postmortems, we'll assume a blameless approach to our analysis so that we fix the real underlying problems and don't just offer up a scapegoat in the name of accountability and let an unsafe system continue.

Contents