Operational support isn't just keeping the systems up—it provides crucial feedback back into the development cycle.
- Now we're going to shift gears a little and discuss how to create feedback loops from the actual production runtime back to development. We'll be covering monitoring and instrumentation basics, and how to take a whole lean approach to this topic. Before diving in, let's start with a little bit of theory. Medical Doctor, Richard Cook wrote an excellent paper called "How Complex Systems Fail". Now, he was primarily concerned with medical process failures when he wrote it, but you know when you read it, you'll swear he was talking about the IT systems that you and I use. Here are a few of my favorite quotes from the paper. "Change introduces new forms of failure." "Complex systems contain changing mixtures "of failures latent within them." "All complex systems are always running in a degraded mode". Each one of these resonates with me, and if you've had any time working around complex systems, I bet they resonate with you as well. It's this task of managing complex systems that makes reliability engineering a really interesting topic. Our approach to reliability engineering is to complete the operations feedback loop to development. This works best in a lean fashion. I bet you probably knew we were going to mention lean again. Let's apply a build measure learn approach to monitoring. Ernest presented a lean approach to monitoring in a velocity conference and I found it to be incredibly useful to think about it as a framework for how to do monitoring. First, build. We want to create a minimum viable monitoring stack, not a fully baked solution, but just enough to accomplish our goals. Second, we want to measure. We want to get a metric from each area of the monitoring. We're going to get into the monitoring area shortly, but your goal is to go wide with your instrumentation first. Third, we want to learn. We want to analyze the application stack and the monitoring in place, and then reframe our goals after learning. And then you repeat, cycle back and go deeper as needed. Let's take a look at the six areas of monitoring that we suggest measuring, Service performance and uptime, software component metrics, system metrics, app metrics, performance, and finally security. Service performance and uptime monitoring is implemented at the very highest level of a service or application. These are often referred to as synthetic checks and they're synthetic because they're not real customers or real traffic. It's the simplest form of monitoring to answer the question of, is it working? The next area of monitoring is software component metrics. This is monitoring that is done on ports or processes, usually located on the host. This moves in a layer, so instead of answering is my service working, it's asking is this particular host working? The next area is a layer deeper, it's system metrics. They can be anything from like CPU or memory. These are time series metrics, and they get stored in graft where you can look at them and answer the question, is this service or host or process, is it functioning normally? All right, next, we get into application metrics. Application metrics are telemetry from your application that gives you a sense of what your application is actually doing. A couple of examples of these like, when you admit how long a certain function call is taking, or maybe the number of logins in the last hour or account of all the error events that have happened, these are all really, really useful, but they're also really custom. Start with your questions about your app and instrument those key pieces, because unlike Pokemon, you can't collect them all. One of the last areas is performance metrics. Tied through all the previous types of metrics, are hints of performance. However, I want to call our real user monitoring, RUM and application performance monitoring, APM. Ernest already talked about APM, but just as a reminder, it's an instrumentation framework that isolates function performance at the code level. Real user monitoring also called RUM, it usually uses front-end instrumentation, for example, like a JavaScript page tag. It captures the performance observed by the users of the actual system. It's able to tell you what your customers are actually experiencing. This is opposed to synthetic checks, which tell you what customers are probably experiencing. The last area is security monitoring. Now attackers don't hack systems magically and admit a special packet that just takes everything down. It's a process and there's enough digital exhaust created from the attack progression that monitoring is possible though sadly, it's often rare. Security monitoring includes four key areas. System, think of things like bad TLS, SSL settings, maybe open ports and services or other system configuration problems. Application security. This is like knowing when XSS or a SQL injection are attempted on your site. Custom events in the application, things like password resets, invalid logins, or new account creations. What are the flows in your system that would be indicators of compromise or that people would want to abuse if they could? And anomalies, you know, when you're seeing HTTP 401s or access attempts from irregular IP segments. All right, let's move on. In several areas of monitoring, we mentioned metrics. Metrics are numerical data points. Often in a time series format, they give indication of system usage. They, in and of themselves, they're not really monitors, but they feed into the monitors. There's sometimes a tendency to turn everything into a metric and then sort it out later. While some people are in favor of gathering every metric in the universe, there's such a thing as over monitoring, I worked on one team where we were so vigorous about our monitoring, that once we did some analysis, we realized a full 30% of our system load was coming from our various monitoring tools. That was not so great. Well, let's transition from talking about monitoring and moving into another area of feedback from operations to design. This is the area of log in.
Updated
10/28/2020Released
11/22/2016In this course, well-known DevOps practitioners Ernest Mueller and James Wickett provide an overview of the DevOps movement, focusing on the core value of CAMS (culture, automation, measurement, and sharing). They cover the various methodologies and tools an organization can adopt to transition into DevOps, looking at both agile and lean project management principles and how old-school principles like ITIL, ITSM, and SDLC fit within DevOps.
The course concludes with a discussion of the three main tenants of DevOps—infrastructure automation, continuous delivery, and reliability engineering—as well as some additional resources and a brief look into what the future holds as organizations transition from the cloud to serverless architectures.
- What is DevOps?
- Understanding DevOps core values and principles
- Choosing DevOps tools
- Creating a positive DevOps culture
- Understanding agile and lean
- Building a continuous delivery pipeline
- Building reliable systems
- Looking into the future of DevOps
Share this video
Embed this video
Video: Operate for design: Metrics and monitoring