Review and understand fault tolerance and high availability concepts and how the cost of availability increases exponentially.
- [Instructor] You've decided to adopt Amazon Web Services or AWS as your infrastructure as a service partner. It's well established that fault-tolerant, highly available systems can be created using AWS tools. Before we get into AWS specifics, let's go over some concepts related to fault tolerance and high availability. To start off, let's talk a bit about the difference between fault tolerance and high availability. The applications and services you operate consist of many components. Let's say you have a classic three-tiered application consisting of load-balanced web servers talking to load-balanced application servers talking to a database server with a hot standby.
Suppose you experience a failure in your web and application tiers. With appropriate load balancing and application state management, your application will continue to operate. Since you've lost some capacity, your application performance may be degraded. Relational databases are a bit trickier. It may take a bit longer for your database to fail to its hot standby. In that failure scenario, your users may experience a brief server interruption. Your applications will still work, but if you lose the standby database, you'll have big problems.
Again, you're operating in a degraded state. A fault tolerant application continues to operate when one of its components fail. Depending on the type of failure, the application may or may not be running in a degraded state. This brings us to high availability. High availability is all about how long a given system stays up over a specified period of time. You've probably had this conversation in terms of the number of nines. For example, AWS has an object storage offering called, Simple Storage Service or S3.
The service level agreement for S3 starts offering service credits if monthly availability falls below 99% or two nines. That comes out to 7.2 hours per month. Your business requirements will dictate the number of nines you need to target from an availability perspective. Of course, the greater the number of nines, the more expensive it is. The cost increase isn't linear. Getting to five nines is incredibly expensive and hard to do. Inside a physical server, there are just so many components that can break.
Individual component failures can include network cards, disk drives, CPUs, memory and more. If you virtualized your servers, you still have the underlying hardware to be worried about. Zoom out to the data center, and you have an entirely new set of concerns. You have to worry about redundant networking switches, routers, environmental controls, internet connectivity, and more. This brings me to one of my favorite quotes of all time, Werner Vogels, VP and CTO at Amazon.com once said that, "Everything fails all the time." He's acknowledging that every physical component is going to fail at some point.
With AWS, you're freeing yourself from the headaches, cost and complexity associated with physical failures. Instead, you get to use the AWS tools to build your application as available and fault-tolerant as your business demands. Before we continue, remember you don't have to be exactly like Netflix. Frequently and deservedly, Netflix is referenced as a poster child of what is possible in AWS. This is because, as a company, it has engineered remarkably resilient and available applications worldwide using AWS tools.
You have to stop and consider what your recovery time objectives or RTOs are. You also need to define your recovery point objectives, or RPOs. Service RTOs and RPOs are just a couple of the business requirements you need to consider as you proceed with designing a highly available system in AWS.
This course is also part of a series designed to help you prepare for the AWS Certified SysOps Administrator – Associate certification exam.
- What is high availability?
- Designing for failure
- Exploring Route 53
- Working with machine images
- Setting up load balancing
- Creating groups
- Working with auto scaling
- Improving relational database availability
- Setting up an elastic file system