Implement and manage Windows Server High Availability

In this sample chapter from Exam Ref AZ-801 Configuring Windows Server Hybrid Advanced Services, you will explore why Windows Server high availability involves knowing the requirements for configuring a failover cluster and how to manage that cluster.

The primary high-availability technology available in Windows Server is failover clustering. Clustering allows you to ensure that your workloads remain available if server hardware or even a site fails. You can configure Windows Server failover clustering for on-premises, hybrid, and Azure hosted workloads. Understanding Windows Server high availability involves knowing the requirements for configuring a failover cluster and how to manage that cluster. This chapter also addresses the deployment, configuration, and management of the highly available Windows Server storage technology named Storage Spaces Direct.

Skills covered in this chapter:

  • Skill 2.1: Implement a Windows Server failover cluster

  • Skill 2.2: Manage failover clustering

  • Skill 2.3: Implement and manage Storage Spaces Direct

Skill 2.1: Implement a Windows Server failover cluster

This objective deals with managing Windows Server failover clusters that are deployed on-premises, in a hybrid configuration, and in Azure. To master this objective, you’ll need to understand Windows Server failover clustering requirements, possible cluster configurations, and the types of cluster workloads that you can deploy.

Implement a failover cluster on-premises, hybrid, or cloud-only

The primary high-availability technology available in Windows Server is failover clustering. Failover clustering is a stateful, high-availability solution that allows an application or service to remain available to clients if a host server fails. You can use failover clustering to provide high availability to applications such as SQL Server and to scale out file servers and virtual machines (VMs). With clustering, you can ensure that your workloads remain available if server hardware or even a site fails.

Failover clustering is supported in both the Standard and Datacenter editions of Windows Server. In some earlier versions of the Windows Server operating system, you gained access to failover clustering only if you used the Enterprise edition. Windows Server supports up to 64 nodes in a failover cluster.

Generally, all servers in a cluster should run either a similar hardware configuration or should be similarly provisioned virtual machines. You should also use the same edition and installation option. For example, you should aim to have cluster nodes that run either the full GUI or the Server Core version of Windows Server, but you should avoid having cluster nodes that have a mix of computers running Server Core and the full GUI version. Avoiding this mix ensures that you use a similar update routine. A similar update routine is more difficult to maintain when you use different versions of Windows Server.

You should use the Datacenter edition of Windows Server when building clusters that host Hyper-V virtual machines because the virtual machine licensing scheme available with this edition provides the most VM licenses.

To be fully supported by Microsoft, cluster hardware should meet the Certified for Windows Server logo requirement. An easy way of accomplishing this is to purchase and deploy Azure Stack HCI, a prebuilt hyper-converged Windows Server installation available from select vendors. Even though it is called Azure Stack HCI and sounds as though it is far more of a cloud-based solution, it’s primarily just an optimized Windows Server deployment on a certified configuration with all the relevant clustering and “Software-Defined Datacenter” features lit up.

Create a Windows failover cluster

Windows Server failover clusters have the following prerequisites:

  • All cluster nodes should be running the same version and edition of Windows Server.

  • You can add clustered storage during or after cluster creation.

  • All servers in the cluster that are located in the same site should be members of the same Active Directory (AD) domain. If configuring a stretch cluster, nodes must be members of the same forest.

  • The account used to create the cluster must be a domain user who has local administrator rights on all servers that will function as cluster nodes.

  • The account used to create the cluster requires the Create Computer Objects permission in the organizational unit (OU) or container that will host the cluster-related Active Directory objects.

Recommended practice is to place the computer accounts for cluster nodes in the same OU and to use separate OUs for each cluster. Some organizations create child OUs for each separate cluster in a specially created parent Cluster OU.

You install failover clustering by installing the Failover Clustering feature, performing initial cluster configuration, running the cluster validation process, and then performing cluster creation. You can use Windows Admin Center, PowerShell, or the Server Manager console to perform these tasks. Once the cluster is deployed, you can manage your clusters using the Failover Clustering Remote Server Administration Tools (RSAT), PowerShell, or Windows Admin Center.

You can install the Failover Clustering feature and its associated PowerShell cmdlets on a node using the following PowerShell command:

Install-WindowsFeature -Name Failover-Clustering –IncludeManagementTools

Validating cluster configuration

Cluster validation performs a check of a cluster’s current or proposed configuration and allows you to determine whether you have the necessary pieces in place to create a cluster prior to attempting to perform this task. Although you can skip validation, recommended practice is to go through the process. This is because even though you may have created numerous clusters in the past it doesn’t mean that the next time you go to create a cluster you accidentally overlook some small but critical detail.

The period prior to cluster deployment is not the only time that you can perform cluster validation. You should rerun cluster validation whenever you change or update a significant component of the cluster. This includes adding nodes, modifying storage hardware, updating network adapters, updating firmware or drivers for network adapters, and updating multipathing software. Cluster validation performs tests in six categories:

  • Inventory Inventory tests determine if the hardware, software, networking, and storage configuration support the deployment of a cluster.

  • Network A detailed set of tests to validate cluster network settings.

  • Storage A detailed set of tests to analyze shared cluster storage.

  • Storage Spaces Direct (S2D) A detailed set of tests to analyze S2D configuration.

  • System Configuration A set of tests on the current system configuration.

  • Cluster Configuration This category of test only executes on deployed clusters to verify that best practices are being followed (for example, using multiple network adapters connected to different networks).

You can perform cluster validation from the Failover Cluster Management Tools that are part of the Remote Server Administration Tools, using Windows Admin Center to connect to an existing cluster, or by running the Test-Cluster PowerShell cmdlet.

Prestage cluster computer objects

During the cluster creation process, a computer object is created in Active Directory Domain Services (AD DS) that matches the cluster name. This AD DS object is called the cluster name object. As mentioned earlier in the chapter, the domain user account used to create the cluster must have the Create Computer Objects permission in order to create this object. It’s possible to have an appropriately permissioned account pre-create a cluster name object. When this is done, the account used to then create the cluster using the constituent nodes does not require the Create Computer Objects permission.

Workgroup clusters

Workgroup clusters are a special type of cluster where cluster nodes are not members of an Active Directory domain. Workgroup clusters are also known as Active Directory detached clusters. The following workloads are supported for workgroup clusters:

  • SQL Server When deploying SQL Server on a workgroup cluster, you should use SQL Server Authentication for databases and SQL Server Always On Availability Groups.

  • File Server A supported but not recommended configuration as Kerberos will not be available as an authentication protocol for SMB traffic.

  • Hyper-V A supported but not recommended configuration. Hyper-V live migration is not supported, though it is possible to perform quick migration.

When creating a workgroup cluster, you first need to create a special account on all nodes that will participate in the cluster that has the following properties:

  • The special account must have the same username and password on all cluster nodes.

  • The special account must be added to the local Administrators group on each cluster node.

  • The primary DNS suffix on each cluster node must be configured with the same value.

  • When creating the cluster, ensure that the AdministrativeAccessPoint parameter when using the New-Cluster cmdlet is set to DNS. Ensure that the cluster name is present in the appropriate DNS zone, which depends on the primary DNS suffix, when running this command.

  • You will need to run the following PowerShell command on each node to configure the LocalAccountTokenFilterPolicy registry setting to 1:

    new-itemproperty -path HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System -Name LocalAccountTokenFilterPolicy -Value 1

To create a workgroup cluster, use the New-Cluster cmdlet with the name parameter listing the cluster name, the node parameters listing the nodes that you wish to join to the cluster where the nodes have been configured according to the prerequisites, and the AdministrativeAccessPoint parameter configured for DNS. For example, to create a new workgroup cluster named workgrpclst with member nodes node1 and node2, run the following command on one of the nodes:

New-Cluster -name workgrpclst -node node1,node2 -AdministrativeAccessPoint DNS

Stretch cluster across datacenter or Azure regions

Failover clusters can span multiple sites. From the perspective of a hybrid environment, site spanning can include spanning two separate on-premises locations, an on-premises location and an Azure datacenter, or having cluster nodes hosted in different Azure regions. When configuring a cluster that spans two sites, you should consider the following:

  • Ensure that there are an equal number of nodes in each site.

  • Allow each node to have a vote.

  • Enable dynamic quorum. Dynamic quorum allows quorum to be recalculated when individual nodes leave the cluster one at a time. Dynamic quorum is enabled by default on Windows Server failover clusters.

  • Use a file share witness. You should host the file share witness on a third site that has separate connectivity to the two sites that host the cluster nodes. When configured in this manner, the cluster retains quorum if one of the sites is lost. An alternative to a file share witness is an Azure Cloud Witness.

If you only have two sites and are unable to place a file share witness in an independent third site, you can manually edit the cluster configuration to reassign votes so that the cluster recalculates quorum.

Manually reassigning votes is also useful to avoid split-brain scenarios. Split-brain scenarios occur when a failure occurs in a multisite cluster and when both sides of the cluster believe they have quorum. Split-brain scenarios cause challenges when connectivity is restored and make it necessary to restart servers on one side of the multisite cluster to resolve the issue. You can manually reassign votes so that one side always retains quorum if intersite connectivity is lost. For example, by setting the Melbourne site with two votes and the Sydney site with one vote, the Melbourne site always retains quorum if intersite connectivity is lost.

You can use Storage Replica to enable stretch clusters with shared storage. Stretch clusters that use shared storage have the following requirements:

  • Cluster nodes are all members of the same AD DS forest.

  • Firewall rules allow ICMP, SMB (ports 445 and 5445 for SMB direct), and WS-MAN (port 5985) bidirectional traffic between all nodes that participate in the cluster.

  • They must have two sets of shared storage that support persistent reservation. Each storage set must be able to support the creation of two virtual disks. One will be used for replicated data and the other for logs. These disks need to have the following properties:

    • They must be initialized as GUID Partition Table (GPT) and not Master Boot Record (MBR).

    • Data volumes must be of identical size and use the same sector size.

    • Log volumes must be of identical size and use the same sector size.

  • Replicated storage cannot be located on the drive containing the Windows Server operating system.

  • Premium SSD must be used for cluster nodes hosted as infrastructure-as-a-service (IaaS) VMs in Azure.

  • Ensure that there is less than 5-millisecond round-trip latency if synchronous replication is being used. If asynchronous replication is being used, this requirement does not need to be met.

  • Storage Replica–configured stretch clusters can use Storage Replica technology to replicate shared cluster storage between locations.

Configure storage for failover clustering

Storage for Windows Server failover clusters needs to be accessible to each node in the cluster. You can use serial-attached SCSI (SAS), iSCSI, Fibre Channel, or Fibre Channel over Ethernet (FCoE) to host shared storage for a Windows Server failover cluster.

You should configure disks used for failover clustering as follows:

  • Volumes should be formatted using NTFS or ReFS.

  • Use Master Boot Record (MBR) or GUID Partition Table (GPT).

  • Avoid allowing different clusters access to the same storage device. This can be accomplished through LUN masking or zoning.

  • Any multipath solution must be based on Microsoft Multipath I/O (MPIO).

Cluster Shared Volumes (CSV) is a technology that allows multiple cluster nodes to have concurrent access to a single physical or virtual storage device, also termed a logical unit number (LUN). CSV allows you to have virtual machines on the same shared storage run on different cluster nodes. CSV also has the following benefits:

  • Support for scale-out file servers

  • Support for BitLocker volume encryption

  • SMB 3.0 and higher support

  • Integration with Storage Spaces

  • Online volume scan and repair

You can enable CSV only after you create a failover cluster and you have provided the shared storage available to each node that will be available to the CSV.

Modify quorum options

A cluster quorum mode determines how many nodes and witnesses must fail before the cluster is in a failed state. Nodes are servers that participate in the cluster. Witnesses can be stored on shared storage, on file shares, in Windows Server, and even on a USB drive attached to a network switch; shared storage is the preferred method.

For unknown reasons, some people use Distributed File System (DFS) shares as file share witnesses when setting up their failover clusters. To stop this type of shenanigan from occurring in the future, Microsoft has configured Windows Server failover clustering so that it explicitly blocks the use of DFS namespaces when configuring a file share witness.

Microsoft recommends that you configure a cluster so that an odd number of total votes be spread across member nodes and the witness. This limits the chance of a tie during a quorum vote.

There are four cluster quorum modes:

  • Node Majority This cluster quorum mode is recommended for clusters that have an odd number of nodes. When this quorum type is set, the cluster retains quorum when the number of available nodes exceeds the number of failed nodes. For example, if a cluster has five nodes and three are available, quorum is retained.

  • Node and Disk Majority This cluster quorum mode is recommended when the cluster has an even number of nodes. A disk witness hosted on a shared storage disk, such as iSCSI or Fibre Channel, that is accessible to cluster nodes has a vote when determining quorum, as do the quorum nodes. The cluster retains quorum as long as the majority of voting entities remain online. For example, if you have a four-node cluster and a witness disk, a combination of three of those entities needs to remain online for the cluster to retain quorum. The cluster retains quorum if three nodes are online or if two nodes and the witness disk are online.

  • Node and File Share Majority This configuration is similar to the Node and Disk Majority configuration, but the quorum is stored on a network share rather than on a shared storage disk. It is suitable for similar configurations to Node and Disk Majority. This method is not as reliable as Node and Disk Majority because file shares generally do not have the redundancy features of shared storage.

  • No Majority: Disk Only This model can be used with clusters that have an odd number of nodes. It is only recommended for testing environments because the disk hosting the witness functions as a single point of failure. When you choose this model, as long as the disk hosting the witness and one node remain available, the cluster retains quorum. If the disk hosting the witness fails, quorum is lost, even if all the other nodes are available.

When you create a cluster, the cluster quorum is automatically configured for you. You might want to alter the quorum mode, however, if you change the number of nodes in your cluster. For example, you might want to alter the quorum mode if you change from a four-node to a five-node cluster. When you change the cluster quorum configuration, the Failover Cluster Manager provides you with a recommended configuration, but you can choose to override that configuration if you want.

You can also perform advanced quorum configuration to specify what nodes can participate in the quorum vote, which you can set on the Select Voting Configuration page of the Configure Cluster Quorum Wizard. When you do this, only the selected nodes’ votes are used to calculate quorum. Also, it’s possible that fewer nodes would need to fail to cause a cluster to fail than would otherwise be the case if all nodes participated in the quorum vote. This can be useful when configuring how multisite clusters calculate quorum when the connection between sites fails.

Configure network adapters for failover clustering

While you can create failover clusters with nodes that have a single network adapter, best practice is to have separate networks and network adapters for the following:

  • A connection for cluster nodes to shared storage

  • A private network for internal cluster communication

  • A public network that clients use to access services hosted on the cluster

In scenarios where high availability is critical, you might have multiple redundant networks connected through several separate switches. If you have a cluster where everything is connected through one piece of network hardware, you can almost guarantee that piece of network hardware is the first thing that fails.

Failover clustering only supports IPv4- and IPv6-based protocols. You can use either IPv4 or IPv6 addresses that are dynamically or statically assigned, but you should not use a mix of dynamically and statically assigned IP addresses for nodes that are members of the same cluster. If you use a mixture of dynamically and statically assigned IP addresses, the Validate A Configuration Wizard generates an error.

Even if the Cluster Validation Wizard only gives you warnings when you perform the test, you cannot create a failover cluster unless each node is configured with a default gateway. The default gateway doesn’t have to be a host that exists, but if you’re having trouble in your virtual machine lab with creating a failover cluster, go back and check whether you’ve configured a default gateway for each node.

Ensure that TCP and UDP port 3343 is open on firewalls between cluster nodes. This port is used for cluster communication, and if communication on this port is disrupted, a node may appear to be in a failed state. Although it’s possible to have single adapters and have cluster and client communication occur over the same adapters and networks, production deployments should use separate adapters and networks for cluster communication.

If there are multiple paths to physical storage on a server, you will need to enable multipath I/O (MPIO). Enabling MPIO aggregates multiple paths to a physical disk into a single logical path for data access. Enabling MPIO improves resiliency to failure. You should enable MPIO on all nodes that will participate in the cluster. You can enable MPIO with the following PowerShell command:

Install-WindowsFeature -ComputerName Node1 -Name MultiPath-IO

Configure cluster workload options

Cluster workload options include workload startup priority, preferred owners, and failover settings. Configuring startup priority determines when the workload becomes available after a cluster disruption. You can choose between High, Medium, Low, and No Auto Start.

Workload failover settings allow you to configure the following:

  • Maximum failures in a period that will trigger failover

  • Period over which these failures are measured

Cluster preference settings allow you to configure the preferred owner for a specific cluster role, and you can also configure different preferred owners for different cluster roles. Where possible, the role is hosted on the preferred owner. You can configure a list of preferred owners so that if the most preferred owner isn’t available, the next preferred owner hosts the role. Workloads will start on the preferred owner at the top of the list and will fail over to other nodes in list order unless specific failback settings are configured. For example, if you have configured the list as Node 4, Node 3, Node 2, and Node 1 if the workload is currently hosted on Node 3 and this node fails, the workload will fail over to Node 2 if it is available and then Node 1 before attempting to fail over to Node 4. You configure a role-preferred owner in the role’s Properties dialog box. You can stop a workload from failing over or being moved to a specific node by ensuring that the node is not present on the workload’s list of possible owners. You can configure this list using the Set-ClusterOwnerNode PowerShell cmdlet.

You configure whether the clustered role fails back to the preferred owner on the Failover tab of the cluster role’s Properties dialog box. When configuring failback, you need to:

  • Determine whether you want to prevent failback

  • Determine whether you want to have failback occur automatically as soon as the preferred owner is in a healthy state

  • Configure failback to occur within a certain number of hours of the preferred owner returning to a healthy state

Node quarantine settings allow you to configure a node so that it is unable to rejoin the cluster if the node fails a certain number of times within a certain period. Configuring node quarantine blocks workloads from being placed back on a quarantined node until the reason for the repeated failure can be dealt with by a server administrator.

File Server failover clusters

The traditional File Server cluster role allows one node in the cluster to serve files from a highly available file share that is hosted on cluster storage. In the event that the node that services client requests fails, the role fails over to another node and clients accessing files will use the new node to perform those operations. Other than increasing resiliency of the workload, there is no performance benefit to increasing the number of nodes for a general-purpose file server workload.

Scale-Out File Servers

A Scale-Out File Server (SoFS) is a new high-availability technology that allows you to share a single folder from multiple nodes of the same cluster. You can use SoFS to deploy file shares that are continuously available for file-based server application storage. This storage is suitable for hosting Hyper-V virtual machine files or Microsoft SQL Server databases with a level of reliability, availability, manageability, and performance that equates to what is possible with a storage area network.

Benefits of an SoFS deployment include:

  • Active-Active file share topology SoFS allows the same folder to be accessed from multiple nodes of the same cluster. An SoFS file share remains online should one or more cluster nodes fail or be taken down for planned maintenance.

  • Scalable bandwidth You can respond to a requirement for increased bandwidth by adding nodes.

  • Automatic rebalancing of clients SMB client connects are tracked on a per-file share basis, with clients being redirected to the cluster node that has the best access to the storage device used by the file share.

  • CHKDSK with zero downtime The Cluster Shared Volume File System, used with SoFS, allows CHKDSK operations to occur without affecting applications that have open handles on the file system.

You should consider SoFS file shares for the following scenarios:

  • Storing Hyper-V configuration and live virtual disks

  • Storing live SQL Server database files

  • Storing shared IIS configuration data

SoFS has the following requirements:

  • The storage configuration must be explicitly supported by failover clustering. This means that you must be able to successfully run the Cluster Validation Wizard before adding an SoFS.

  • SoFS requires Cluster Shared Volumes.

Windows Server 2019 and Windows Server 2022 support an SoFS role called the Infrastructure File Server. An infrastructure SoFS uses a single namespace share for the Cluster Shared Volume drive. The benefit of the Infrastructure File Server role is that it allows the Hyper-V host to communicate using guaranteed continuous availability to the Infrastructure SoFS SMB server. A failover cluster can only support a single infrastructure SoFS instance. To create an infrastructure SoFS, run the following PowerShell command:

Add-ClusterScaleOutFileServerRole -Cluster ClusterName -Infrastructure -Name InfrastructureSoFSName

Virtual machine failover clustering

One of the most common uses for failover clusters is hosting virtual machines. By deploying a workload such as SQL Server or Exchange on a highly available virtual machine, you can achieve high availability without the need for the application to be aware that it is now highly available. The virtual machine functions normally, provides services to clients on the network, and can switch between cluster nodes as necessary in the event that the individual cluster node hosting it requires maintenance or experiences some sort of failure. Building a Hyper-V failover cluster first involves creating a failover cluster and then adding the Hyper-V role to each node of the cluster.

You should use Cluster Shared Volumes to store virtual machines on a Hyper-V cluster because CSV allows multiple cluster nodes to manage a single shared storage device. This allows you to put multiple virtual machines on the same shared storage device but have those virtual machines hosted by different nodes in the failover cluster. Cluster Shared Volumes are mapped under the C:\ClusterStorage folder on cluster nodes.

When creating a new virtual machine on a failover cluster, first select which cluster node hosts the virtual machine. When creating a highly available virtual machine, specify the Cluster Shared Volume path as the location to store the virtual machine. If you have an existing machine that you want to make highly available, you can move the virtual machine to this path. As an alternative, you have the option to specify an SMB 3.0 file share as the storage location for the highly available virtual machine. Whether to select a Cluster Shared Volume or an SMB 3.0 file share depends on your organization’s storage configuration.

After the virtual machine is created, you can control it by using the Failover Cluster Manager console. The Move option in the Actions pane allows you to select the cluster node to which you want to move the virtual machine.

In production environments, you should ensure that each Hyper-V host has an identical hardware configuration. However, in development environments, this is not always possible. If different processor types are used—for example, an Intel processor on one node and an AMD processor on another—you might have to perform a quick migration. Quick migration allows migration between nodes but does cause a disruption in client connectivity. You can allow migration between Hyper-V nodes with different processor types or versions by enabling the processor compatibility setting on the virtual machine.

VM Monitoring is a failover cluster feature that allows you to monitor the health state of applications that are hosted on a guest virtual machine. This monitoring allows the cluster to take remediation actions in the event that the application or service fails. You can configure VM Monitoring for any Windows service or event log event that occurs on a guest VM. To use VM Monitoring, you need to enable the Virtual Machine Monitoring rule on the guest VM. You can configure monitoring on a VM using the Failover Cluster Manager or the Add-ClusterVMMonitoredItem cmdlet. You can configure whether the failure of an application or service triggers a guest VM restart on the current node or the guest VM to fail over to another node by configuring the failover properties of the virtual machine.

Configure cluster sets

Cluster sets are new features available in Windows Server 2019 and later, and they allow clusters to be loosely federated, allowing you to migrate workloads between clusters with minimal downtime. When you implement cluster sets, you can combine smaller clusters into larger virtual clusters. These virtual clusters support virtual machine fluidity and a unified storage namespace, meaning that virtual machines can be moved between clusters in a cluster set as easily as they are moved between nodes in a traditional failover cluster. While virtual machines can be live-migrated between clusters, Windows Server cluster sets do not allow virtual machines to be configured to automatically fail over between clusters in a cluster set.

Only clusters running at the Windows Server cluster functional level 10 or higher can participate in cluster sets. This means that clusters can have nodes running the Windows Server 2019 and Windows Server 2022 operating systems but that you cannot join clusters running Windows Server 2016 or earlier Windows Server operating systems to a cluster set. All clusters in a cluster set must be members of the same Active Directory forest. If you are going to perform live migration of virtual machines between clusters in a cluster set, you must ensure that the nodes share the same processor architecture.

Cluster sets improve the failover cluster life cycle. Rather than adding and removing nodes from a single cluster when the node hardware is to be retired, you can add a new cluster to the cluster set, migrate workloads across from the original cluster to the new cluster, and then retire that original cluster. Cluster sets are currently supported by Microsoft for up to 64 cluster nodes, which is the same number of nodes supported in an individual failover cluster. That being said, there is no specific limit to the number of nodes that may exist within a cluster set, so going beyond 64 nodes in a cluster set is possible should you want to try it.

Cluster sets consist of a management cluster and member clusters. The management cluster is the cluster set that holds the management role for the cluster set and also hosts the unified storage namespace for the cluster set Scale-Out File Server. The management cluster does not host workloads, such as virtual machines, and its role is to manage the relationship between other clusters in the cluster set and to host the storage namespace. The new role that the management cluster hosts is termed the Cluster Set Master (CS-Master). Member clusters hold the Cluster Set Worker, or CS-Worker, role. The namespace for the cluster set is hosted by a Scale-Out File Server cluster role running on the management cluster. When creating the Scale-Out File Server role for the management cluster, you use the Add-ClusterScaleoutFileServerRole PowerShell cmdlet with the Infrastructure parameter.

Cluster sets support the configuration of fault domains and availability sets. Fault domains are collections of hardware and software where a single fault, such as a power outage, causes a set of cluster nodes to fail. If you are using stretch clusters to span datacenters, each datacenter could be configured as a separate fault domain. Availability sets allow you to configure workload redundancy across fault domains.

Create an Azure witness

A Cloud Witness has the Cluster Witness role hosted in Azure rather than at a third site. A Cloud Witness is suitable for multisite clusters. To configure a Cloud Witness, create a storage account in Azure, copy the storage access keys, note the endpoint URL links, and then use these links with the Configure Cluster Quorum Settings Wizard and specify a Cloud Witness.

A Cloud Witness should be blob storage in a general-purpose standard performance storage account. The Cloud Witness should use storage access keys for authentication rather than shared access signatures or Azure AD system–assigned managed identity. You should configure the Cloud Witness storage account to use locally redundant storage if the cluster is on-premises or hosted in Azure using a single availability zone. If the cluster uses multiple availability zones, you should choose zone-redundant storage.

Configure a floating IP address for the cluster

Windows Server clusters allow you to define an IP address as a cluster resource and have the IP address fail over between cluster nodes. To do this, you need support for the following:

  • Support for dynamic registration and deregistration of IP addresses (DHCP)

  • Ability to update network address translation caches of hosts attached to the subnet on which the IP address resides

DHCP and ARP are likely to automatically handle these elements of floating IP address configuration as this technology has been available since the release of Windows NT 4.

Implement load balancing for the failover cluster

There are two primary methods for load-balancing cluster workloads. The first is used to ensure that a Hyper-V cluster that hosts VMs balances VM workloads equitably across the available nodes. This ensures that you don’t end up with a situation where one node has a consistent 95 percent processor utilization due to VM workloads while other nodes in the cluster have relatively idle processors. In addition to VM load balancing, some workloads, such as Scale-Out File Servers, automatically balance incoming client requests across cluster nodes. Not all workloads support automatic balancing of client requests and the second type of load balancing, called network load balancing, balances incoming client network traffic across a number of hosts.