- Site reliability engineering basics
- Release engineering
- Change management
- Incident management
- Distributed design
Skill Level Advanced
- There's that 2AM moment when you realize you've made some bad life decisions. - You roll over and you hear this ringing in your ears. - After questioning how it got to this point in the relationship, with dread, you look at your phone. - Oh no. Not again. Two nodes of the cluster are down. And the others look to be failing as well. - You're not surprised though. This is the fourth night in a row this has been going on. - Now if this story sounds to real or you want to make sure your life doesn't end up like this, then this is the course for you.
- Howdy. I'm Ernest Mueller. - Hi. And I'm James Wickett. Welcome to our course on another DevOps foundation, site reliability engineering. - We met while implementing DevOps in a large enterprise. Together we've run the DevOps Days Austin Conference and blog at theagileadmin.com. - I'm the head of research of Signal Sciences, which provides application security defense solutions for APIs, microservices, web APIs. At Signal Sciences, we implement a DevOps and SRE practices from the very beginning.
- And I'm director of engineering operations at AlienVault, a maker of cybersecurity management and thread intelligence solutions, where I optimize our infrastructure and software delivery pipeline. - Site reliability engineering or SRE, is central to delivering software. - Since the term SRE was coined by Google, it's grown in popularity. While SRI and DevOps aren't exactly the same, they fit together as complimentary approaches. - [James] In this course, you'll learn the basics of reliability engineering, including self-service automation and dealing with releases.
- [Ernest] And handling crisis situations through incident response. - [James] We also cover how to perform post-incident evaluations. - [Ernest] The SRE's core tenant is reliability. And we dissect how to define SLAs and SLOs, as well as how to handle performance engineering and troubleshooting. - [James] We discuss adding adversity and chaos to your system, as well as how to design for distributed systems. - [Ernest] And finally, well explore concepts on scaling systems and your team. - All right. Let's get started.