Learn how to calculate the number of 9s for determining availability goals.
- [Instructor] So what is service availability? It's defined as the percentage of time that an application is operating normally. And a way to calculate it is to say that the availability is equal to the normal operations divided by the total time. And this is often expressed as percentage above time. For example, 99.9 over time, a year. So 99.9% per year. This is often called the number of nines. For example, five nines means that your system is planned to be 99.999% available.
So you wanna considered dependencies when you're planning availability goals. You wanna understand hard dependencies on other systems, whether it's the public internet, or other systems that you don't own or control. You wanna consider the cost of redundant components. I mentioned this early on. We'll be talking about it quite a lot throughout this course. So server instances for EC2 or RDS or EMR or other availability zones, placing services there. So there's a formula. The availability is the mean time between failures divided by the mean time between failures plus the mean time to recover.
For example, if the mean time between failures is 150 days and the mean time to recover is one hour, the your availability estimate is 99.97. Now in order to calculate this, of course, you're going to have to know what your mean time to recover actually is and that means you're gonna have to have known and tested backup and restore or backup and recover strategies for all of your key services. So here's an example chart, just to get you started thinking about this.
We have availability percentage on the left, so the number of nines, from two to five. We have down time and this is availability over a one year period. And then we have example types of applications for the level of availability. Now before I go into this more deeply, you might say, well what do you mean? Everything should be 100% available. Well you'll be talking like a business person. There are a costs associated to availability. Not only service costs, in terms of redundant servers, for example, but more importantly personnel costs. Who is going to maintain those redundant servers? Who's gonna tests the backups? So the availability goals for particular applications need to be business justified and that's where these different levels of nines come into play.
So for example, you might look at batch processing or data extraction at an availability level of two nines. So if you were not able to do batch processing for a couple of days, up to three days and 15 hours per year that would be okay. And defining that goal for the broader business is a very important aspect of availability. The next example is three nines. So down for up to eight hours and 45 minutes in a year. Internal tools, such as for project management. Now while that might sound horrible, couldn't do project management for a whole day, be aware that this estimation is not continuous.
So it's not eight hours continuously. It could be eight, one hour stints, depending on your mean time to recover. The next is 99.95, so three and a half nines if you will. That's four hours and some minutes. Online commerce, point of sale systems. Again, four hours is not continuous here. Wanna look at your mean time to recover. Then four nines, 52 minutes. Video delivery, broadcasts systems. And then five nines, which every manager says they want, but when you show them the costs, the do not definitely wanna pay for it.
ATM transactions, telecommunications. It's really important to be able to use metrics to communicate so that expectations are realistic and so that people who are funding the availability approaches can understand what it is they're funding. I find a lot of tension in shops around availability situations where there have been outages and there isn't this level of communication. Speaking of communication, if you're working with servers that you own, most probably you will have scheduled maintenance, so for database servers, for application servers.
And an important consideration that I see often overlooked, so I wanted to call that out here, is to have a discussion about, does planned service maintenance count, in the availability calculations? And the reason I'm saying count, is because what I actually advocate for is that once you establish an availability goal, that you make it a key business metric and possibly even associate some financial gain to the teams that are responsible for achieving that metric, if they do. So again, I like to say what gets measured matters and the top performing teams that I see are very clear and crisp on their communication and they have discussed all aspects of availability, including maintenance when they make their goals each year and then update them periodically.
Again, just to underscore, it's important to understand the cost of high availability. As the scenarios become more complex, and we'll be seeing scenarios in depth later on in this course, you'll need more testing and validation for more types of failures. Can the network traffic cause of failure or lack of it? Can one server being down cause a failure? Can an external dependency cause a failure? Defining the possibilities for failure and then figuring out how to test and validate, because for anything that could fail, in order to have measurable metrics, you have to test the mean time to recover.
That takes time and money. More recovery and automation from failure scenario types. As you're testing and figuring out how to recover from various failures, as a best practice, you'll want to add automation to automatically recover from possible failures. Of course, that's not magic. There's a cost to creating that automated recovery. And although it is in some ways a one time cost, because once you've set it up, it's good to go, it can be substantial. And one that I think's really subtle, if you are adding more testing and validation for more types of failure, you're also potentially slowing innovation because you have overhead if you're gonna change the system.
Can't just push new code or push new features because you're going to have to add this testing and validation for potential failure to continue to meet your availability goals.
- Understanding high availability (HA)
- Preparing for HA
- Designing for HA
- Understanding continuous deployment (CD)
- Types of verification tests used in CD
- Server mutability and CD
- Implementing CD
- Advanced CD pipeline techniques
- CD pipeline with Step Functions