Communication is the key to collaboration and solving problems when the stakes are high.
- We talk about the importance of communication but sometimes that seems like a fuzzy topic. So in this section, we're going to hone in on two very specific key aspects of communication as it relates to devops: blameless postmortems and transparent uptime. Now, these two activities require even better communication skills than normal as they're often occurring whenever the stakes are really high. Let's get right to the details of running a blameless postmortem. You know, postmortems should run within 24 or 48 hours after a customer is affected with a major outage. This helps keep everything fresh. In preparing for the postmortem, it's good to have the team collaborate and build up a timeline. But assign one person to run the meeting. Oh yeah, and you're going to want to avoid having the person who caused the outage be the one who runs the meeting. I mean, otherwise it'll just get kind of awkward. When starting the process, I like to ask all the team members to fill in as much of the timeline before the meeting with the goal of having objective events in the order that they occur. I suggest putting everything in UTC time. And I mean everything. I mean, we want to correlate human events like, "The CEO "called and asked us why the site was down," to monitoring and logging events like, "We saw a spike "in HTTP 500 events." When starting the postmortem meeting, it's important to lay down the rules of order. You know, the goal here is to cover the entire process and you should let the team know that we aren't here to assign blame to a certain person. Instead, we're trying to prevent the incident or a similar outage like it from occurring in the future. You want to let everybody know this is a blameless exercise. It's important to say this at every postmortem to level everyone's expectations. Start with making sure the description is discussed and agreed upon as well as any supposed root cause. You want to allow any inputs or alternate theories of the root cause. Also, make sure that everyone's on the same page about what fixes were made to stabilize the situation. You know, it can be really tempting to stop here. But this is where postmortems turn into learning. Review the entire timeline and make sure that everyone has all the needed items put into it. Be sure to also discuss how were customers affected by the incident? And at the end you want to ask how can we detect this sooner in the future? You know, failures are unavoidable and we have to optimize for detection and recovery other than just trying to optimize to prevent failures. The last step is to create tickets or action items. And these should range from like new monitors or tests or even fixes. Record these and get everybody in the meeting to publicly agree on a deadline. Okay, that more or less wraps up blameless postmortems. Let's shift gears and talk about the opposite of postmortems, transparent uptime. In a world of SaaS products, when your site or your app or your service is down, it has a huge impact on the users. Transparent uptime means that as we interact with our customers, we communicate with them as much as possible during an outage. In the Transparent Uptime blog, Lenny Rachitsky gives four points he recommends as prerequisites for doing transparent uptime. First, admit failure. You want to really own it. I mean, customers aren't surprised that you have downtime. And really, it's okay to have failure but it's not okay to hide it. Second, sound like a human. It's real common to get corporate-sounding responses that aren't real apologies. Talk real to your real customers and avoid using any doublespeak when apologizing. Third, have a communication channel. Make sure your customers know about it. Make sure it's updated. And most importantly, make sure it's out of band. You want to use a different cloud provider or host so that it doesn't go down when the rest of your site does. That could be kind of bad too. All right, fourth. Above all else, be authentic. This can be hard to do in the heat of the moment when there's an outage. It's good practice to take one individual out of the team to have them manage communication during the outage. Their primary job is to figure out who the stakeholders are and be authentic with them about the issues at hand.
Updated
10/28/2020Released
11/22/2016In this course, well-known DevOps practitioners Ernest Mueller and James Wickett provide an overview of the DevOps movement, focusing on the core value of CAMS (culture, automation, measurement, and sharing). They cover the various methodologies and tools an organization can adopt to transition into DevOps, looking at both agile and lean project management principles and how old-school principles like ITIL, ITSM, and SDLC fit within DevOps.
The course concludes with a discussion of the three main tenants of DevOps—infrastructure automation, continuous delivery, and reliability engineering—as well as some additional resources and a brief look into what the future holds as organizations transition from the cloud to serverless architectures.
- What is DevOps?
- Understanding DevOps core values and principles
- Choosing DevOps tools
- Creating a positive DevOps culture
- Understanding agile and lean
- Building a continuous delivery pipeline
- Building reliable systems
- Looking into the future of DevOps
Share this video
Embed this video
Video: Use your words