Designing High Availability with Microsoft Exchange Server 2010

  • 7/15/2010

Planning Cross-site Failovers

The high-availability improvements in Exchange 2010 make it even easier to deploy cross-site failover solutions without a need for third-party network and storage solutions. The secondary site can be used to handle primary site outages resulting from maintenance or other, more serious failures. Even with the improvements in Exchange 2010, careful planning must be done to successfully deploy and maintain a multi-site deployment.

Cross-site DAG Considerations

The primary building block of a cross-site solution is the cross-site DAG. Extending a DAG between sites does have a couple requirements, including the following:

  • Fewer than 250 milliseconds of latency between all DAG members. To ensure consistent DAG operations there should be minimal latency.

  • At least one domain controller in each site. Exchange requires a domain controller in each site it is deployed; for redundancy at least two should be deployed.

  • At least one Client Access server in each site. To provide client connectivity to both sites at least one Client Access server must be deployed; for redundancy at least two should be deployed.

  • At least one Hub Transport server in each site. To provide e-mail transport to both sites at least one Hub Transport must be deployed; for redundancy at least two should be deployed.

  • Consider the impact on supporting services to a failover. The appropriate number and configure of Client Access servers, Hub Transport, Edge Transport, Unified Messaging server roles, and domain controllers must be located at each site to support the maximum number of active mailboxes.

  • In the case of a complete datacenter failure:

    • Quorum must be reestablished. To mount databases, a quorum must be established within the cluster. If a majority of the members, including the file share witness, are unavailable the DAG must be manually reconfigured to reestablish quorum.

    • Manual switchover process. To bring up the second site, the administrator must manually initiate the switchover. A complete datacenter switchover is not something to consider lightly from a business process standpoint. Requiring manual intervention was put in place to ensure that an administrator has to make the decision to initiate a full datacenter switchover.

Cross-site Considerations for Client Access and Transport

When you deploy non-Mailbox servers to support a cross-site failover, you might come across several issues, including Domain Name System (DNS) entries for Outlook Web App, Outlook Anywhere, and Autodiscover. Inbound e-mail (MX) must be redirected to reflect the secondary site’s IP addresses. These record changes should be automated to provide the quickest return to service. Until the clients that connect to these services have the new addresses they will fail. These changes can be improved by deploying DNS servers in multiple locations or by using third-party global-server load balancing. If you are using a hosted anti-spam or archiving service these services must be redirected to the new site.

Proper namespace planning is needed for the failover process to run smoothly. To do this you must consider each datacenter as being active and choose a unique set of names for each Exchange service. This includes OWA, Post Office Protocol version 3 (POP3), Internet Message Access Protocol version 4 (IMAP4), Exchange Web Services, and Outlook Anywhere; however, it cannot include Autodiscover. Having this number of names requires that you configure certificates to reflect the names that each site uses. To do this, ensure that the certificates contain all required host names for services in both datacenters or use a wildcard certificate. If you choose to use separate certificates for each datacenter, you must ensure that each certificate has the same certificate principal name. To reduce the impact on Outlook connections, you must run Set-OutlookProvider EXPR -CertPrincipalName msstd:<certificate principal name>. For more information on namespace planning see Chapter 4, “Client Access in Exchange 2010.”

Cross-site Switchover

Deploying a DAG across two sites can allow database copies to exist in two locations and provide site resiliency. This allows a single mailbox database to fail over and switch over to the secondary site. The client software will react to the changes in one of two possible ways when the active mailbox database is moved from one site to another. Understanding these reactions is important to ensuring that you perform the correct type of failover for your needs:

  • The Client Access server will directly connect to the Mailbox server.

  • The client will be redirected to connect to the second site, as shown in Figure 11-16.

Figure 11-16

FIGURE 11-16 Comparing cross-site connections and redirect

Exchange 2010 SP1 includes functionality to control the connection behavior of Outlook when a cross-site database failover or switchover occurs. By default, Outlook will connect across from the primary Client Access server to the activated Mailbox server for temporary cross-site situations. Alternatively, the administrator can prevent all cross-site connections. Temporary and permanent cross-site moves are differentiated by the administrator explicitly resetting the database copy activation preference.

In the initial release of Exchange 2010, the default behavior is to perform a direct connect from the Client Access server array in the first datacenter to the mailbox hosting the active copy in the second datacenter. Redirection will only occur when the RPCClientAccessServer property is changed on the mailbox database. In SP1, you can choose to enable or disable cross-site direct connect and define an activation preference for a database.

The new SP1 behavior is based on the following three properties:

  • Home server property in Outlook

  • Preferred database site (RPCClientAccessServer)

  • Active database site

Cross-site direct connect happens in the following scenarios:

  • If the Outlook profile home server value, preferred database site, and mounted database site are the same, Outlook will connect (or stay connected) to the Client Access server array and that will connect to the Mailbox server cross-site.

  • If the Outlook profile array site is the same as the preferred database site, and the mounted database site is different and cross-site connections are allowed, Outlook will connect (or stay connected) to the Client Access server array and will connect to the Mailbox server cross-site.

  • If the Outlook profile home server property value is the same as the mounted database site, and different than the preferred database site, Outlook will connect (or stay connected) directly through the to the Client Access server array to the Mailbox server cross-site. This happens when you change the activation preference.

Redirection happens in the following scenarios:

  • If the Outlook profile home server property value is different, and the preferred and mounted database sites are the same, the RPC Client Access service must redirect Outlook to the preferred and mounted database site and update the Outlook profile.

  • If the Outlook profile home server property value is the same as the preferred database site, and the mounted database site is different, the Client Access server will redirect Outlook to the mounted database site if cross-site connections are not allowed.

Using cross-site direct connect is often suitable when a single mailbox server is undergoing maintenance or there are other temporary issues that will be resolved in a short period of time. Redirection may be needed when multiple systems or the entire datacenter will undergo maintenance. Performing a redirection switchover will force the clients to reconnect to the secondary site and allow maintenance to be completed. If redirection is used to switch over, it will also be done to perform the switchback to allow the clients to reconnect to the primary site. To enable cross-site direct connect, run Set-DatabaseAvailabilityGroup <DAG Name> -AllowCrossSiteRpcClientAccess: $true from the EMS. Conversely, to disable cross-site direct connect, run Set-DatabaseAvailabilityGroup <DAG Name> -AllowCrossSiteRpcClientAccess: $false from the EMS. To determine whether cross-site direct connect is enabled, run Get-DatabaseAvailabilityGroup <DAG Name> | Format-List as shown in Figure 11-17.

Figure 11-17

FIGURE 11-17 Retrieving the cross-site direct connect setting

Handling Datacenter Failures

To prepare for activating a secondary site in the case of a primary site failure, you must enable datacenter activation coordination (DAC) mode on the DAG by running Set-DatabaseAvailabilityGroup <DAG Name> -DatacenterActivationMode:DagOnly. Also in preparation you should also set the alternate witness server and alternate witness directory for a server available in the second site. This allows an administrator to activate the site even if a majority DAG members remain unavailable in the failed site, and it prevents split-brain scenarios. The Active Directory site defines the datacenter boundaries; therefore, to enable DAC mode, the DAG must span at least two sites. A datacenter failure is a catastrophic event because such a failure requires an administrator to make the decision to perform a full datacenter switchover, because the process is not automatic. The datacenter switchover process includes the following steps:

  1. Evaluate the situation and then decide to perform a datacenter switchover.

  2. Configure the DAG to remove the primary site’s servers from the Windows Failover Cluster, but retain them in the DAG. This is done by running Stop-DatabaseAvailabilityGroup <DAG Name> –ActiveDirectorySite <Primary Site Name> -ConfigurationOnly in the primary site, if possible.

  3. Configure the DAG to use an alternate witness server and restore the functionality in the secondary site. To do this, first stop the cluster service on each of the secondary site’s DAG’s servers, and then run Restore-DatabaseAvailabilityGroup <DAG Name> -ActiveDirectorySite <Secondary Site Name>.

  4. Start the cluster service on each of the servers in the DAG in the secondary site. The remaining Active Managers will then coordinate mounting databases in the secondary site.

  5. Adjust DNS records, if necessary, for Simple Mail Transfer Protocol (SMTP), OWA, Autodiscover, and Outlook Anywhere. These adjustments can be done manually or automatically using a third-party global-server load balancer.

After the primary site is recovered you may choose to perform a switchover to the primary site. This process includes the following steps:

  1. Evaluate the situation and decide to perform a datacenter failback. Verify that the primary datacenter is capable of hosting Exchange services.

  2. Reconfigure the DAG to add the DAG members in the primary datacenter back into the failover cluster by running Start-DatabaseAvailabilityGroup <DAG Name> –ActiveDirectorySite <Primary Site Name>.

  3. Configure the DAG to use the primary site’s witness server by running Set-DatabaseAvailabilityGroup <DAG Name> –WitnessServer <Primary Site Witness Server>.

  4. Manually reseed or allow replication to update the primary datacenter’s database copies, depending on the state of the primary site copy.

  5. Schedule downtime for the mailbox databases and then dismount them.

  6. Move databases back to the primary datacenter by running Move-ActiveMailboxDatabase <Database> –ActivateOnServer <Server in Primary Site>, and then mount the databases in the primary datacenter.

  7. Adjust DNS records, if necessary, for Simple Mail Transfer Protocol (SMTP), OWA, Autodiscover, and Outlook Anywhere. These adjustments can be done manually or automatically using a third-party global-server load balancer.

In Exchange Server 2010 DAC mode tasks are available to restore service in a standby datacenter while a minority of the DAG members are available. Prior to SP1, DAC mode was limited to at least three members in the DAG. In that three-node DAG, two members needed to be in the primary datacenter (Active Directory site). In SP1, DAC mode has been improved to support a two-member DAG with a member in each datacenter. As with all DAGs with an even number of members, this implementation requires a witness server to provide the additional vote to obtain quorum.

Cross-site Best Practices

You can use the best practices described in this section to ensure a successful, highly available, multiple-site configuration. First, you can reduce failover times by lowering the Time to Live (TTL) on DNS records for the Client Access server array, Client Access server URLs, and SMTP records. A low TTL reduces the time it takes DNS clients to discover the DNS entries that point to the secondary site. If any client computers that use DNS services are outside of your control, such as a regional ISP, be sure to verify that these services will honor any TTLs set—this will impact service availability for these users. By default a DAG is configured to only compress and encrypt transaction log shipping across different subnets. To take advantage of network compression between sites, you must manually enable intersubnet compressing and encryption.

Never wait until a failure occurs to ensure that everything works as designed. You should continually monitor and verify that all messaging-system components are functioning properly. This is done by monitoring all aspects of the Exchange Server environment to ensure that it is functioning normally, and that mailbox data is successfully replicating to the secondary site in a timely manner. You should also schedule periodic switchover tests to provide an additional level of preparation and to validate the configuration and operation of the cross-site switchover process. Switchover tests are usually coordinated events where the primary servers are shut down cleanly to reduce the possibility of data loss. When performing these drills be sure to verify that you are not missing steps that would be required in a real switchover scenario where the primary datacenter becomes unavailable.

You should also follow a change management process to ensure that each Mailbox server in the DAG, each Client Access server, and each Hub Transport server are configured identically with the same updates applied. Doing so reduces the possibility of incompatibilities and unexpected behavior if a *over occurs.

Provide adequate bandwidth for replication traffic. Replication is always from source to target; therefore, multiple copies in the remote site means more bandwidth is required. To reduce the amount of bandwidth needed you should be sure that compression is enabled on the log shipping traffic for the DAG. The Exchange 2010 Mailbox calculator can be used to help estimate the bandwidth required.

Finally, you should have each DAG node connected to multiple networks. These multiple networks provide communication redundancy between DAG nodes and segregate MAPI and replication communications. To reduce network congestion and potential communications problems, you should not allow the DAG networks to route between each other. For example, you would not allow the replication network to communicate with the MAPI network or vice versa. This communication should be blocked by the network equipment, with a router or a firewall.

Multi-Site Storage Architecture

You must consider a number of factors when determining the hardware needed to support your highly available Exchange deployment, as discussed in detail in Chapter 13, “Hardware Planning for Exchange Server 2010.” Having multiple database copies requires storing data on multiple disks; this reduces the requirement for having RAID-protected storage because the data is redundantly stored. Deployment decisions for RAID or JBOD should be based on cost, performance, IT operational maturity, and required resilience. To provide for storage failures, redundancy is either provided by having additional database copies or by using RAID on the storage. Table 11-6 summarizes instances when RAID or JBOD should be considered.

TABLE 11-6 Choosing Between RAID and JBOD

2 HIGH-AVAILABILITY COPIES

3 + HIGH-AVAILABILITY COPIES

2 + HIGH-AVAILABILITY COPIES / DATACENTER

1 LAGGED COPY

2 + HIGH-AVAILABILITY COPIES AND 1 + LAGGED COPIES / DATACENTER

Primary Datacenter

RAID

RAID or JBOD

RAID or JBOD

RAID

RAID or JBOD

Secondary Datacenter

RAID

RAID or JBOD

RAID or JBOD

RAID

RAID or JBOD