From the course: AWS Well-Architected Framework: Operational Excellence Pillar

Design principles

- [Instructor] There are several design principles that are highlighted in the operation excellence pillar, part of a well architected framework that you need to consider following if you wish to achieve operational excellence for your applications that are hosted at AWS. The first one is to perform your operations as code. Well, what's that mean? Well, our infrastructure is all software, all virtual applications, virtual components, and they can all be automated. They can all be created using a script, ie, using code. Networks, ECQ instances, monitoring. They all can be defined with automation. A network can have logging. That logging can be captured in a log and sent to the monitoring service, which alerts you if there's network issues. Again, all code operations. Instances can be automated and built. Monitoring and alerts can be devised using code. So our operational procedures longterm can all be scripted and they can all be automated. Yes, this will take some time, but it's a goal to look at achieving. Examples of services at AWS that you can use for automation include CloudFormation, Systems Manager, and CloudWatch. CloudFormation allows you to automate your entire applications stack. Systems Manager helps you automate the updates, for example, of your EC2 instances. And CloudWatch is the monitoring service that can let you know when there are issues at all aspects of your stack for all AWS services that you use. Documentation. Our favorite topic, well, documentation is now annotated. It's automated because when you run a managed service, it reports what it's doing. It creates its own logs. It's reports, it's alerts. So in a sense that documentation is created by operations. Another example of annotated information after a process has completed is CloudFormation. Let's say you create a script that builds your network. At the end of a successful build of your network using code, there will be outputs that detail what exactly happened. If I use CloudTrail, CloudTrail is a service provided by Amazon that will list all AWS account activity. Again, more documentation created by operations. We can also store our ECE2 instance system log data in the monitoring service, ie, CloudWatch logs. So we have lots of ways of gathering information, gathering documentation that's being built almost real time. Finally, if I want to know about information about my network, well, I can use VPC flow logs. I can actually tell Amazon to tell me exactly what's going on in this network completely, or maybe just at the sub-net level, or maybe just an instance. Grab that traffic flow so I can analyze it and see if there's an issue. Overtime, there will be changes to your application stack. Maybe they're part of the design. Maybe there'll be frequent changes. So you have to make a consideration of how we handle these changes. Are they daily, are they weekly, are they monthly, or are they yearly? Depends on the component that we're talking about. Maybe it's to do with snapshots. Snapshots at Amazon are backups. So perhaps I want to backup my data in the database on a frequent schedule, or at least once a day, we have to make that determining factor. Weekly, there's probably going to be some updates. I want to test these updates in my test environment for sure, but there's probably going to be some updates that I have to consider. Maybe they're monthly, maybe I'm archiving logs, maybe there's changes to my AMI images. Those are the images that create my EC2 instances. So there will be some frequency. How do you handle those changes? Can you automate those changes? And of course there will be yearly changes at the very least. Amazon will have major changes every year. Amazon has changes every week, but you probably are going to want to look at how you are doing things at Amazon and see what has changed. What is new, what could be added to your design to make it quote unquote more excellent. So we have to refine our operation procedures operating in the cloud because Amazon is making changes and almost forcing us to look at what has improved. After all, you probably have asked Amazon for some feature updates. When they deliver, you're going to want to make a change. So as your workload evolves over time, so must our processes and procedures. Now, why might my workload evolve over time? Well, perhaps I use auto scaling. I decide to scale my applications up and down automatically. That's going to be a big change based on my workload, based on the EC2 instances I choose to use. So there could be some real interesting changes as you adapt with some of the features at AWS. Maybe your storage needs change. You need more storage. You decide to archive. Maybe, as mentioned, your compute needs are going to change. Maybe there's a new EC2 instance that you want to moving towards. Over time, there's always database features. Over time, you'll probably make some changes to that database. At the very least you'll need more storage and probably more speed. And of course, as you learn more about your application, you're going to want to monitor more and software changes needs change. You might decide to do something differently. We also have to anticipate failure. We have to try to be proactive by being reactive, but the reaction has to be automated and we have to have the ability of fail them over or making a change when the application gets into potential problems. So we have to test our scenarios for failures before we put them online in production. And we have to understand what is the impact of every failure at every single layer, which means I have to do even more testing. And then I have to test my responses to those failures to make sure that they're effective. And finally, we have to learn from our operational failures. Communication is absolutely key to all key members. The improvements that you learn are going to be driven by lessons learned. And then I want to share that information across the teams and throughout the entire organization, right up to the executive level. So when we're designing for operations, we have to look at the workload, the deployment of that workload, the updates, and the overall daily operation. We have to capture a wide range of information, what's going on with that workload, the user activity, the changes in the state, the overall utilization. The logging is going to show you what is happening and what has happened. If you don't log, you won't know.

Contents