Ops has learned hard lessons about resiliency over the years. In this video, learn how to take it into account when building your applications.
- [Instructor] Last time we talked about the theory of designing resilient systems from the development perspective. Let's bring our herd one operations experience to the table. You are now about to witness the strength of street knowledge. First, the hard truth is that all systems fail. Code is often written with the assumption that failure of underlying systems is if not impossible, at least very unusual, and should probably have to result in some manual intervention but failure and slow downs are common. Pushing more and more money at highly available systems is a losing game. I spent a lot of time in my career listening to excuses about why that system that doesn't have any single point of failure still went down. Don't fall for that trap. In fact, you can take the opposite approach with what we call deliberate adversity. One of the most celebrated bits of software in dev ops is another gem from the Netflix crew and it's called the Chaos Monkey. We talked about it in a previous video, but in short, it monitors the Netflix production environment which all runs in Amazon Web Services and every so often it reaches out and kills a random server. This is simple, but at the same time, profound and revolutionary. It tells everyone that not only can you not count on a given server but we're going to actively jack it up. This drives fundamentally different approaches to system design like the Hystrix library I mentioned earlier and the benefits are clear. Netflix has consistently weathered major AWS outages with little to no loss of service especially when compared to other web properties on the same platform. A quote you'll hear a lot is if it hurts, do it more often. The best way to avoid failure is to fail constantly. It sounds silly to have to say it, but your system is, by definition, in a degraded state when things are going wrong and in a modern distributed system, there's a lot of things to possibly be broken. Operational processes must depend on as few integration points as possible. If you decide, oh, I'll just call the AWS API to get that information. Well, what happens when your access to that API is getting throttled or it's down? You need to be able to draw an X through any dependency of your system and understand what its behavior will be in that state. Many new architectures avoid the trap of single points of failure by replicating everything. Cassandra, for example is a popular big data solution where every piece of data is replicated three times across a ring of servers. Losing one server then isn't a big deal. Whereas in a lot of traditional RDBMS architectures, losing the server is often occasion for weeping and gnashing of teeth. Next, let's talk about performance. Again, this is something that code effects to a much higher degree than the hardware it runs on. I remember one time a dev team I was working with rolled out a new version of their search driven web application and performance degraded by 10 times. Well, can we get more hardware, they asked? No. I said, fix your code. One hot fix later performance was back close to the previous baseline. The power of software is that depending on how you decide to perform an operation, you can spend 10, a hundred or more times, the system resources and time with a stroke of your virtual pen. There are a variety of tools you can use to validate and improve the performance of your system. The first is of course, performance testing during the build and deploy pipeline but this doesn't tell you what's wrong, it just tells you if you got faster or slower. If you're a developer you need to understand how to use a code profiler. Profilers exist for everything from the C code I used to sling back in the day to the newest language like Golang. For some reason, it seems that the ability to use a profile has been dying out and I see developers relying on very black box techniques to understand why their app is slow. Understanding performance system-wide can be more difficult but luckily there's a class of tools for what they call application performance management. APM tools do distributed lightweight profiling across a whole architecture and let you bring together timings and metrics to identify bottlenecks and slowdowns. You can run APM tools in production and may have to if you can't reproduce a problem in staging but I've also found them to be more useful in the development process. Find the problems before you roll them out. In a distributed system, performance issues are often worse than straight up failures. A slow responding service both doesn't fulfill its contract and also wastes resources on calling systems. It's like wounding a soldier in combat. You end up taking more people out of the fight than if they had just been killed outright. Keeping a handle on performance issues, including running baselines on every single build in your CI pipeline is critical to your service's health. There's way more to say about these topics. The general approach, however, is make sure you have operational expertise incorporated into the development phase of your product and that you design in performance and availability from the beginning. Finally, you want to implement things to make maintenance easier. In Operate for Design, James will discuss monitoring, metrics and logging. These aren't after thoughts. They're things to design and implement upfront and our first order requirements of your system. Please plan them out.
Updated
10/28/2020Released
11/22/2016In this course, well-known DevOps practitioners Ernest Mueller and James Wickett provide an overview of the DevOps movement, focusing on the core value of CAMS (culture, automation, measurement, and sharing). They cover the various methodologies and tools an organization can adopt to transition into DevOps, looking at both agile and lean project management principles and how old-school principles like ITIL, ITSM, and SDLC fit within DevOps.
The course concludes with a discussion of the three main tenants of DevOps—infrastructure automation, continuous delivery, and reliability engineering—as well as some additional resources and a brief look into what the future holds as organizations transition from the cloud to serverless architectures.
- What is DevOps?
- Understanding DevOps core values and principles
- Choosing DevOps tools
- Creating a positive DevOps culture
- Understanding agile and lean
- Building a continuous delivery pipeline
- Building reliable systems
- Looking into the future of DevOps
Share this video
Embed this video
Video: Design for operation: Practice