Monitoring, troubleshooting, and metrics are a vital space in your tooling strategy.
- Welcome back to your SRE toolchain. - This is an area with a huge amount of tooling, because it covers most of the historical key areas of operations. - So, instead of talking about every tool in this space, we're going to try to cover some that are representative DevOps and the best practices and how to use them. - Usually the most successful tools, are the tools that multiple groups agree to use. Look for sharing and integrations as important features of any tool. - Okay, let's start with monitoring. This is the bread and butter of any operations team. There are solutions like Nagios or Zabbix that have been around for more than a decade, but, what's new in the monitoring world? - Well, first there's the rise of SaaS. Many new monitoring offerings are provided as a service, from simple endpoint monitoring like Pingdom, to system and metric monitoring like Datadog, Netuitive, Ruxit and Librato, to full application performance management tools like New Relic and AppDynamics. These provide extremely fast onboarding and come well provided with many integrations for many modern technologies. - [James] And there's a whole category of open-source tools like StatsD, Ganglia, Graphite and Grafana. You can use these to collect large scale distributed custom metrics. - [Ernest] That's right, and you can pull those and put them into a time series database like InfluxDB or OpenTSDB to process them. - There are application libraries specifically designed to emit metrics into these. Like the excellent metrics library from Codahale. - There are plenty of new open source monitoring solutions designed with more dynamic architectures in mind. Icinga and Sensu are two solutions somewhat similar to Nagios in concept, and can use the large existing set of Nagios plugins, but have more modern UIs and are easier to update in an ephemeral infrastructure world. - You know, and containers have brought their own set of monitoring tools with them. Stuff like the open-source tools Prometheus and Sysdig. - Log management, has become a first order part of the monitoring landscape. This started in earnest with Splunk, the first log management tool anyone ever wanted to use. - Yeah, and then it moved to SaaS with Sumo Logic, Logentries and similar offerings, but it's come back around full circle as an excellent open-source log management system has emerged. Composed of Elasticsearch, Logstash and Kibana, it's often referred to as the ELK stack. - PagerDuty and VictorOps are two sterling examples of SaaS incident management tools that help you holistically manage your alerting and on-call burden. - Yeah, this means you don't have to rely on the scheduling and routing and functionality built into the monitoring tools themselves. - And those don't always work very well. - Yeah, yeah. Well, there's even an open-source project called Flapjack at flapjack.io that can help you do that yourself if you wish. - Statuspage.io provides status pages as a service. You may have seen some of these in use by some of your SaaS providers. In accordance with transparent uptime principles, services can gateway their status and metrics to external or internal customers, and allow them to subscribe to updates of these pages. - Okay, we mentioned orchestration back into the configuration management section, but a command dispatcher like Rundeck, SaltStack or Ansible, those are a good part of your operational environment for purposes of Runbook automation. This means running canned procedures across systems for convenience and reduction in the manual error. - [Ernest] In my current job, we use a bash framework called Rerun that's available on GitHub, that allows you to write bash with proper language support, it handles option parsing and script structure, and has a unit testing facility built in. All of our AdHoc automation gets built as Rerun scripts, which can then be exported to Rundeck as Rundeck jobs and run across our infrastructure. - And don't forget about security. There are a variety of security monitoring tools out there. You want to look for ones that can be integrated more closely with your systems and be run when you deploy new applications. You know, a quarterly scan is good for compliance, but it's not really good for real security. - No. - You know, there are a million tools you might use as an SRE. Some of the best you're probably going to write yourself, but these are some of the ones that Ernest and I have seen used. All right, well, that's the end of this third major practice area of DevOps, reliability engineering. In our next video, we'll build on this foundation by giving you resources to go out and learn more on your own.
In this course, well-known DevOps practitioners Ernest Mueller and James Wickett provide an overview of the DevOps movement, focusing on the core value of CAMS (culture, automation, measurement, and sharing). They cover the various methodologies and tools an organization can adopt to transition into DevOps, looking at both agile and lean project management principles and how old-school principles like ITIL, ITSM, and SDLC fit within DevOps.
The course concludes with a discussion of the three main tenants of DevOps—infrastructure automation, continuous delivery, and reliability engineering—as well as some additional resources and a brief look into what the future holds as organizations transition from the cloud to serverless architectures.
- What is DevOps?
- Understanding DevOps core values and principles
- Choosing DevOps tools
- Creating a positive DevOps culture
- Understanding agile and lean
- Building a continuous delivery pipeline
- Building reliable systems
- Looking into the future of DevOps