From the course: vSphere 6.7 Foundations: Administer Availability

Fault tolerance in vSphere 6.7

- [Rick] In this video, I'll explain how you can use fault tolerance to provide zero downtime protection of virtual machines. So fault tolerance is different than high availability. With fault tolerance, we're looking at protecting critical virtual machines, so that in the event of a failure, there's no loss of data, transactions, or connections, and that there's zero downtime in the event of a failure. And the way that fault tolerance works is that the VM that we are protecting is going to be mirrored to a second virtual machine on a different EXSi host. And, the state of the primary VM will consistently be synchronized to the secondary VM through a mechanism called checkpointing. So, if the ESXi host that the primary VM is running on, fails, that will result in an immediate fail over to the secondary. The secondary will become the primary, and a new secondary will be re-spawned to protect the new primary. So one of the features of fault tolerance that's relatively new is that the secondary VM will actually be stored on a separate datastore, in order to protect the data of the virtual machine. So it's not just keeping the virtual machine up and running. It's keeping a secondary copy of all of the virtual machine data, on a different datastore. Fault tolerance is not meant to protect an entire cluster the way that HA does. With fault tolerance, we're going to protect specific mission-critical VMs. And by default, we can only protect a maximum of eight virtual machines per host. Now, I can change that by modifying advanced settings on the ESXi host. But by default, I can protect a maximum of eight virtual machines per host with fault tolerance, and I can only protect virtual machines with a maximum of eight vCPUs, depending on my licensing edition. Now if I have vSphere Standard or vSphere Enterprise, the maximum size of a VM that I could protect with fault tolerance is two vCPUs. If I have vSphere Enterprise Plus, I can protect a virtual machine with up to eight vCPUs with fault tolerance. And I can migrate these virtual machines with vMotion. I cannot migrate them with storage vMotion. But I can use vMotion to move virtual machines from host to host, even if they're protected with fault tolerance. Fault tolerance also supports automatic migration of these VMs using DRS. So DRS and fault tolerance are compatible with one another. Another relatively new enhancement of fault tolerance is it now supports multiple disk formats. So, an old requirement of fault tolerance was to have a thick-provisioned disk. Now, fault tolerance supports thin provisioned, thick provision eager zeroed, and thick provision lazy zeroed disks. A prerequisite for fault tolerance is I have to enable high availability on my host cluster. I cannot configure fault tolerance if high availability has not been enabled. So let's take a look at a diagram, and better understand exactly how fault tolerance works. So on the left here, we see the primary virtual machine. This is the VM that we want to protect with fault tolerance. So we've got this primary VM running on a certain ESXi host. And this VM has a set of files, including its VM decay, its VMX, and other critical files that are stored on one particular datastore. And so, we're going to actually have a secondary copy of this virtual machine running. Now before I can do that, I have to establish some of the prerequisites for fault tolerance. So here in our diagram, you can see there's a VMkernel port configured on each of these hosts. That VMkernel port is specifically tagged for fault tolerance logging traffic. It could be carrying other types of traffic like management traffic or VMotion traffic, but I have to choose that option, it has to enable fault tolerance logging. This is the network that's going to be used to synchronize the state of the primary to the state of the secondary virtual machine. So this has to be a 10-gigabit per-second network as well. And a dedicated network as recommended by VMware. Strictly for fault tolerance. So, once I've got my fault tolerance logging network created, then I'll enable fault tolerance on my primary virtual machine, and a secondary virtual machine will be automatically created. Not only that, but a secondary copy of these files for the primary virtual machine will be created on a second datastore. So now, I have a copy of the primary VM running on one ESXi host, and a secondary copy on a different ESXi host, and I have all of the files for that VM on two different datastores. And ideally, these would be two completely separate storage devices. So if I can put them on two different datastores, on different storage arrays, that'll maximize my availability. So, as any data changes, or any changes occur at all in the primary VM, those changes are copied over this fault tolerance logging network to the secondary virtual machine. And so in that meter, the secondary VM always stays exactly the same as the primary VM. As a matter of fact, if you open consoles to both virtual machines, and you move the mouse around, you can see the mouse moving in the secondary VM. They are exactly the same. So, what happens if we have a failure? Let's break down a host failure. So here's my primary VM, running on host ESXi01. And here's all the files for that VM. Here's my secondary VM running on ESXi02, and all of the same files are being maintained on a secondary datastore. So now, if ESXi01 fails, that's going to take down my primary VM. So there goes my primary virtual machine. What's going to happen is I'm going to have an immediate failover to the secondary VM. That one's going to become the primary virtual machine. And so I'll have a zero downtime failure to that secondary VM. And then, once that secondary VM is the primary, a new secondary VM will be created on some other ESXi host, and another copy of those files will be created on another datastore to re-protect the new primary. So now if this new primary fails, there's another secondary out there, ready to take over. So, fault tolerance gives me zero downtime protection in the event of a host failure. What about a failure of the guest operating system? So, for example, let's say in my primary virtual machine, something bad happens, right? Let's say Windows goes to the blue screen of death, right? Or, I delete some data that shouldn't be deleted, or a virus appears on this primary VM and takes it down. Well, any changes that are occurring in that primary operating system are going to be immediately replicated to that secondary VM, so it doesn't give me any protection from operating system-level failures. Basically, fault tolerance is just replicating whatever happens in the OS. So if something bad happens in the OS here, that same bad thing is going to happen over here. So, which features are supported and not supported when you enable fault tolerance on a virtual machine? Fault tolerance Storage vMotion is not supported. Virtual Volumes are not supported either. So if you're using a VVOL storage system, that's not a possibility with fault tolerance. We also can't configure raw device mappings or USB devices on our fault-tolerant virtual machines. DRS, however, is supported. So we can place fault tolerance VMs on a DRS cluster, and DRS can be used to automatically migrate those virtual machines. So in review, unlike high availability, fault tolerance provides 100% uptime. When I enable high availability, I enable it on an entire cluster, and it protects all of the virtual machines in that entire cluster. But, if a host fails in that cluster, all of the VMs on that host are going to fail and eventually reboot on other hosts. That's different than fault tolerance. With HA, I have to wait for VMs to restart in the event of a failure. With fault tolerance, I don't configure it on the entire cluster. I configure it on individual VMs. And if something happens and a host fails, those VMs still have 100% uptime. Fault tolerance is going to have a primary and a secondary VM running on separate hosts and on separate datastores. And if the primary host fails, the secondary VM immediately takes over, and a new secondary is spawned to re-protect that fault-tolerant virtual machine. Fault tolerance supports virtual machines with a maximum of eight vCPUs. Depending on my licensing edition. So the eight vCPU feature is supported with Enterprise Plus. With any other licensing edition, it's two vCPUs maximum per virtual machine. And a maximum of of eight fault-tolerant VMs can run on any single host, but remember we can override that using advanced configuration commands. Fault tolerance supports thin-provisioned, thick-provisioned eager zeroed, and thick-provisioned lazy zeroed disks. So I can choose whatever disk format I want when I'm creating a fault-tolerant virtual machine. And then finally, I have to configure high availability on my cluster in order to configure fault tolerance. So I can't configure fault tolerance on standalone hosts. The hosts have to be part of a cluster, and that cluster has to have HA enabled, in order for me to enable fault tolerance on a virtual machine.

Contents