Exam Ref 70-342 Advanced Solutions of Microsoft Exchange Server 2013 (MCSE): Design, Configure, and Manage Site Resiliency

  • 2/10/2015

Objective 2.3: Design, deploy, and manage site resilience for transport

So far in this chapter, you have considered site resiliency for mailbox databases with database availability groups, and client access servers by using consistent URLs and load balancers. The final service to consider is the transport services, or more specifically the SMTP protocol.

In Exchange 2010, the transport service was a role that you could co-exist with other roles or, though it was not recommended, install as a dedicated role. In Exchange 2013, as server hardware is at a point where installing the roles all on the same server is better, the transport role has disappeared as an installation option, but still exists on both the CAS role and the mailbox role. Because it is a role within Exchange Server 2013, we need to consider it for site-resiliency.

This objective covers how to:

  • Configure MX records for failover scenarios
  • Manage resubmission and reroute queues
  • Plan and configure send/receive connectors for site resiliency
  • Perform steps for transport rollover

Configuring MX records for failover scenarios

Email delivery in Exchange Server uses a number of different configuration options such as send connectors to deliver email away from a server or receive connectors to accept email onto a server. For both of these connectors, and for inbound and outbound connectors in Office 365 (which is covered in a later chapter), mail exchanger (or MX) resource records play an important part.

When an email is being sent from the Internet to Exchange Server (or any other email system), you either need a configuration known as a smarthost or MX records in public DNS. A smarthost value is the name or IP address of the server that you want to send the email to. This is controlling mail flow directly. To avoid managing how everyone on the Internet wants to send you email on an individual basis, if you publish in your public DNS zone an MX record that ultimately resolves to the IP address of your inbound email server, users on the Internet can email your users easily.

When you create an MX record in DNS you need to provide the following:

  • You need to provide an A record in DNS that resolves to the external IP address of the system that receives email for your domain from the Internet. If you have a spam and virus filtering service in front of your Exchange Server, the IP address will be the external IP address of this device. If you have a cloud hosted spam and virus filtering service that you have subscribed to, you do not need to create the A record as the filtering company will have created it already, but you will need to know the name of this A record.
  • You need to provide an MX record created in your domain, typically with no host name, that uses the A record created above. If you have a spam and virus filtering cloud service in use, they will provide you with the A record.
  • You also need to provide a priority value. This is a number that you allocated to each MX record and which control the order that multiple records, if you have them, are used.

In Figure 2-19 you can see the MX record creation dialog box from a Windows DNS server. This dialog box shows that the MX record for contoso.com has a priority of 10 and points to the Microsoft cloud hosted spam and virus filtering service called Exchange Online Protection (EOP). The value for the MX record in this case is provided by EOP to the Contoso administrators.

Figure 2-19

Figure 2-19 Adding an MX record in DNS

Different public DNS providers will have different ways for you to add MX records, but they will all require these three pieces of information. If you are hosting your own mail server or SMTP filtering service on premises it is important to note that the A record that the MX record refers to must be an A record (or AAAA for IPv6 records) and cannot be a CNAME record.

When a sending SMTP server wants to deliver to your domain, they look up the MX record in public DNS for your domain, resolve this to the IP address of the A record, and then connect to that IP address. This can be tested using the command line with Nslookup. The command line to type is nslookup -q=mx domain.com. Figure 2-20 shows you an example output.

Figure 2-20

Figure 2-20 Nslookup output for an MX record query

You can see from this figure that there is an MX record of priority 10 and that resolves to three A records each with a different IP address. This is only one way to do high availability of SMTP services, because an SMTP server will automatically pick one A record from the list (usually the first one) and connect to that server. If that server does not respond, the next record on the list will be used. Within an Active Directory site, Exchange Servers use the same technique to connect to other servers. They resolve the IP addresses of all the Exchange mailbox servers in the site and then connect to one of them, and if that fails, connect to another. You do not need MX records within the Active Directory site, but the principle of connection is the same.

Behind each of the multiple IP addresses that this example MX record points to could be an SMTP server, or it could be a load balancer and a considerable number of servers. As SMTP manages its own load balancing you can publish a single IP per server on your external firewall direct to each inbound SMTP server that is able to receive from the Internet. If you are short of available IP addresses, you would use a load balancer. A load balancer can be used to remove connections from a server that is not responding, or to keep the number of connections across all of your servers about the same, but with SMTP it has its own retry functionality built into the protocol, so it is not always required.

In addition to having multiple A records behind a single MX record, you can have multiple MX records each pointing to a different SMTP host. If these records all have the same priority value, they will be used equally by the sending SMTP server. Imagine for example a domain with three MX records, all with priority 10, with the following hosts:

  • mail-us.contoso.com A 131.107.2.200
  • mail-gb.contoso.com A 131.107.6.150
  • mail-hk.contoso.com A 131.107.9.99

When this domain is queried for its MX records using Nslookup, this would result in the answer shown in Figure 2-21.

Figure 2-21

Figure 2-21 Nslookup response to an MX query with more than one MX record

You can see from Figure 2-21 that each MX record is shown, and each A record IP address is shown. As DNS in this example is a DNS server that supports round robin, each A record and associated IP address will be returned in a different order each time. Therefore, each querying SMTP server would connect to the first returned IP address and send its email. Though as you can see from the example, this would mean that inbound emails would be distributed across the world irrespective of the sending server or the recipient because the first IP address returned to the sending SMTP server is done by DNS and is irrespective of the source or destination of the email.

Taking the above example, if the Hong Kong office was the primary office and the London and New York offices were to be used to receive email if the Hong Kong office went offline, you would either give mail-hk.contoso.com a higher priority than the other two records, or decrease the priority of the other two records. When talking of MX records, the lower the priority value, the higher the priority of the server. This means that an MX record with a priority of 10 will be connected to before any MX record with a priority of 20. The MX server with a priority of 20 will only be connected to when the 10 priority server does not respond. This can be seen in Figure 2-22. In this figure the Hong Kong office has a priority of 10 and the other two offices have decreasing priority (i.e. the numbers increase). Therefore, inbound email via MX record lookup will always go via mail-hk.contoso.com.

Figure 2-22

Figure 2-22 Nslookup showing different priority results for an MX record lookup

Therefore, for inbound site resilient email delivery, you should have multiple MX records each of different priorities with the highest priority/lowest value being the A record to the primary server.

When you use a spam and virus-filtering service, there are different techniques to direct email to your preferred server after they have been sent through the filter, and to automatically use a secondary server in the event that the first becomes unavailable. The exact configuration will depend on the vendor of the server, but adding multiple smarthosts or IP addresses with a priority similar to that used in MX records is a common implementation.

Microsoft Exchange Online Protection uses a different technique for emails that clear the filter and are due to be delivered onward to an on-premises server. In Exchange Online Protection the outbound connector is used and the smarthosts value used to determine the IP address of the target server. If the smarthosts value is a name (and not an IP address), this name will first be looked up as an MX record, and then secondly resolved as an A record. This means that you can add a single smarthosts value that can be priority based. This is done by creating multiple MX records for inbound email that are different than the MX record for your domain (as that needs to point to Exchange Online Protection) either by creating an MX record for a hostname in your domain, or for a separate domain. In Figure 2-23 you can see the output from two Nslookup commands, the first for the domain and the second for onprem.cantoso.com. The MX for contoso.com goes to EOP and the MX for onprem.contoso.com goes to mail-hk.contoso.com with a priority of 10, and if that is offline mail-gb.contoso.com as that record has a lower priority. In EOP the smarthosts value for the outbound connector would be onprem.contoso.com.

Figure 2-23

Figure 2-23 MX records for EOP and a site resilient EOP smarthost

In this example, EOP will deliver all filtered email to 131.107.9.99 (mail-hk.contoso.com) as the smarthost value in the connector reads onprem.contoso.com. If mail-hk goes offline, it will automatically use mail-gb.contoso.com, but while the mail-hk host is online it will never use the mail-gb host.

Managing resubmission and reroute queues

Within Exchange Server 2013, MX records are not used to deliver email between DAGs, sites, and servers. Instead a list of all the mailbox servers in the target DAG, or if not DAG, the target Active Directory site are used in a round robin fashion for connecting to. Each connection that a given transport server makes is logged into the connectivity log files. There is a connectivity log file for each transport service including FrontEndTransport on the CAS and Hub and Mailbox Transport on the mailbox role. An example of a connectivity log file can be found at C:\Program Files\Microsoft\Exchange Server\V15\TransportRoles\Logs\Hub\Connectivity for the transport service on the mailbox role. If a server is not responding and it is selected as a target for SMTP connections, this will be logged in the connectivity log as an attempt to connect. Because the connection will fail, the source server will pick another server, if there is another available, and connect to it. Therefore, for very simple and easy site and cross-site resiliency in Exchange Server, you should have more than one mailbox role (or multi-role) server per Active Directory site that you have Exchange Servers located in.

In the event that all the target servers in a DAG or site are offline, or the target network is unavailable, Exchange Server will attempt delivery to the nearest server to the point of failure. This is done by taking the Active Directory site link costs to the target site, or to the site that contains the nearest member of the target DAG, and connecting to the first available server along that least cost path. An example of this is shown in Figure 2-24. In this figure you can see five sites for a European company where the faster network, and therefore the Active Directory replication links, go through Paris. For fault tolerance, there are slower backup links direct to some regional offices, but with more costly Active Directory site links, they are not used unless the lower cost link is unavailable. The Zurich site is down so the Exchange Server in Zurich is unreachable. A user in London sends an email to a recipient whose mailbox is on the Zurich server. The least cost path for this email to take is London to Paris to Zurich, which would have a cost of 20. All other possible routes would have costs of 50 or higher and so Exchange Server will not use them because it only uses the least cost route. Remember that though the least cost route is calculated, Exchange Server will still attempt to connect directly to the target server in the destination site or DAG before any server on the route.

Figure 2-24

Figure 2-24 Least cost routing when a site is unavailable

Therefore, with Zurich being offline, the connection from London direct to Zurich will fail and so the Exchange Server in London will connect to one of the two Exchange Servers in Paris. If the Zurich site remains offline for a while, further emails from London to Zurich will begin to queue approximately evenly across both servers in Paris. Emails from Madrid and Berlin will also take the least cost route, which for them is the route via Paris. This means that emails from Berlin and Madrid senders will also queue in Paris. The direct links from Berlin and also from Madrid to Zurich are more expensive than the links via Paris, and so are not used.

On the Paris servers, the messages will queue and the queue will be retried every minute for five minutes, and then every 10 minutes until the messages time out at two days. Once the Zurich site is back online and the Exchange Server in Zurich is able to receive connections from Paris and the other sites, mail flow will resume within the retry time of 10 minutes.

Using Exchange Management Shell and the Retry-Queue cmdlet, a retry can be forced rather than waiting for the next retry interval. The next retry interval can be determined by using the Get-Queue cmdlet.

Considering Figure 2-24, imagine a scenario where the link from Paris to Zurich is unavailable, but the site is up and the separate links from Berlin and Madrid are online. In this scenario, email from London will still queue in Paris, but email from Madrid and Berlin will connect successfully. This is because although the least cost route from, for example, Madrid is Madrid to Paris to Zurich as a cost of 20, the Exchange Server in Madrid will make a direct connection to the server in Zurich and successfully connect.

The only way to get the emails queued on the Paris servers (from Paris and London senders) to Zurich while the link is down, would be to either change the cost of the Paris-Zurich link to more than 40 (so that the cost of the Paris to Berlin to Zurich link is less expensive), or to remove the Paris-Zurich link.

If you remove the link rather than increase the cost, note the following:

  • The Paris, Madrid, Zurich link, which also has a cost of 40, will not be used because the Paris, Berlin, Zurich link will be chosen as the least cost route. This is because both Paris, Berlin, Zurich and Paris, Madrid, Zurich cost 40. Therefore, the hop count is used to choose the least cost route. In this example though, both routes have two hops and so a single least cost route has not been determined. When more than one route has the same least cost, and as there is still more than one route with the same hop count, the route that has the lowest alphabetical site name will be chosen. Therefore, in this example Paris, Berlin, Zurich will always be the least cost route over Paris, Madrid, Zurich given the above costs and hop count because Berlin is lower alphabetically than Madrid.
  • When any link cost or other factor that is used to determine least cost route is changed, only new messages are automatically evaluated for these changes. Existing messages already queued on a server have passed through the routing stage of the server and are waiting to connect to the determined next server. Their route will not automatically be recalculated.
  • The IP networking and routing is the same as the Active Directory site links. London does not have a direct site link to Zurich and so cannot connect to Zurich directly. There is no valid route from London, but there is a valid route to the other sites, and so Paris can be connected to.

To fix this issue without fixing the problem with the Paris-Zurich routers and link, the Paris-Zurich connector could be increased in cost from 10 to 100 (Set-ADSiteLink Paris-Zurich-ExchangeCost 100). If this was to happen, new emails from London would go London to Berlin to Zurich at a cost of 60, and emails from Paris would go Paris to Berlin to Zurich at a cost of 40. The London server would still attempt to connect directly to the Zurich server, but as it does not have connectivity, it would now connect to Berlin as that is the hop before Zurich on the least cost route, unlike Paris which was the previous hop when the link costs were lower. Once queued at Berlin, it would connect successfully to Zurich and bypass the broken connection between Paris and Zurich.

If London had IP connectivity direct to Zurich, there would be no need to change the costs as the messages would not queue in Paris.

In the above scenario where messages from London and Paris are queued on the Paris server and the cost of the Paris-Zurich link is changed to 100, you will also need to force the emails in the queue to be recalculated for routing so that they can be sent to Berlin. To do this, you would use the Retry-Queue –Resubmit $true cmdlet. For example, if you ran Get-Queue you might see 100 emails queued for the Zurich Active Directory site with a Queue ID of PARIS1\1234, where PARIS1 is the server name and 1234 is the queue ID. In this case, you would change the cost of the Paris-Zurich connector and then run Retry-Queue PARIS1\1234 –Resubmit $true. In the example shown in Figure 2-24, you would need to repeat this cmdlet, with the correct queue ID on PARIS2 as well (for example Retry-Queue PARIS2\554 -Resubmit $true where 554 is the queue ID on server PARIS2). This can be seen in Figure 2-25.

Figure 2-25

Figure 2-25 Resubmitting messages that are queued after site link costs changed

In the event that there are messages queued on a server, and you need to take that server down for maintenance, in Exchange Server 2013 there is the Redirect-Message cmdlet. This cmdlet will actively move messages from one Mailbox server (that is where the transport queue lives) to another server. To use Redirect-Message, you need to stop the server receiving inbound messages; otherwise, when the redirection is complete, it will be able to accept new messages again. Once the redirection is complete, you can run the required maintenance on the server knowing that the server will not be a valid target for incoming messages.

There are two cmdlets needed to take a server into maintenance from a transport perspective. These are:

  • Set-ServerComponentState <SourceServerName> -Component HubTransport -State Draining -Requester Maintenance
  • Redirect-Message -Server <SourceServerName> -Target <TargetServerName>

Shadow and poison queues are never redirected to the other server. Therefore, for site resiliency, ensure that any server that is taken down for maintenance is back online as soon as possible. That server might contain shadow messages for other servers in the delivery group that, in the event of loss of those other servers or lagged copy rollback, this server might be needed for. Shadow messages are always stored on another server in the same delivery group as the receiving server. The delivery group is either other servers in the Active Directory site (if the server is not a member of a DAG), or other members of the DAG, or if the DAG has members on more than one site, the members of the DAG in the other site. For more details on shadow redundancy, see later in this chapter.

Planning and configuring send/receive connectors for site resiliency

As you have seen in the previous sections, you add additional MX records or IP addresses for the same host that the MX record uses to provide site resiliency for inbound email from the Internet. Within Exchange Server, you just need to have more than one mailbox role server to receive email from other Exchange Servers.

On the Exchange Server itself there are numerous receive connectors to accept the inbound email. The default receive connectors are as follows:

  • Client Access Role:

    • Default Frontend Servername
    • Outbound Proxy Frontend Servername
    • Client Frontend Servername
  • Mailbox role:

    • Default Servername
    • Client Proxy Servername

Unlike Exchange 2010, there is no requirement to configure any settings to receive anonymous emails that are destined for your Exchange organization This requirement isn’t needed because the Default Frontend Servername receive connector accepts anonymous connections by default.

As each client access role server has a receive connector that accepts anonymous connections, configuring inbound mail flow for site failover scenarios comes down to load balancer or MX, server name or IP address configuration. For inbound email from the Internet, you need to have a way to ensure that when a site goes offline, that the standby site can take over emails easily. This is best done with two or more MX records of differing priority as discussed earlier, though geo-load balancers can be used as well for larger deployments.

For internal mail flow that starts outside the Exchange organization, for example application servers and devices that generate email notification and reports, these need to be configured with the IP address of an Exchange client access role server because these servers have the frontend transport service and anonymous submission should you need it. The problem with configuring applications and devices with an IP address is that you need to change it on all of the applications and devices when failover occurs, or when you upgrade to a newer version of Exchange, or add new servers. Therefore, the best way to control mail flow within the network inbound to Exchange is to use an MX record and an A record pointing to multiple IP addresses, and to use an FQDN that resolves to this MX or A record within the applications and devices.

This FQDN allows the IP to be changed in the event of a failure in the current target server, or to have multiple IP addresses and to make use of the native load balancing within the SMTP protocol. If you have applications or devices that can only take an IP address, or if you have multiple IP addresses in an A record and this negatively impacts these applications, you should use a load balancer to distribute the load and to allow simple failover to a different server or site in the event of an outage, or when it comes time to migrate to a new server.

Performing steps for transport rollover

In the event of an outage of either a CAS or mailbox role server, there may be impacts to message delivery that will need resolving, or impacts to messages that were in the queue on the server that failed.

In the event of server failure, any technology that directs connections to an alternative server will be sufficient. As discussed in previous sections of this book, this can include load balancers, or for inbound emails, more than one MX record or more than one IP for the MX or A record being used. In scenarios where you have one or more of these systems in place, new connections made after the time of failure will fail to connect to the box suffering outage, but will succeed in connecting to alternative servers.

If a message was currently in transit, a different scenario needs to be looked at first. All mail flow into Exchange Server 2013 should go via a client access server, and receive connectors on mailbox servers should not be modified to receive external traffic. When an email is received by the frontend transport service on a CAS role server, listening on TCP port 25, the initial connection is made and the SMTP headers accepted. Upon receiving the RCPT TO header, the frontend transport service queries the Active Directory to determine the mailbox database of these recipients. If these recipients are mailboxes, the frontend transport service makes a connection to the DAG or site (if the mailbox is not in a DAG) that contains the active mailbox copy. If there is more than one recipient, an evaluation of up to the first 20 recipients is made to determine which DAG or site should be used for the majority of these first 20 recipients. If the recipients are distribution lists or other mail objects (mail users, and so on), a connection is made to a mailbox role server in the same site as the CAS server.

Once this connection is made, the body of the message is passed through the CAS frontend transport service without further modification or inspection. It is passed to the selected mailbox role server and the transport service on that server.

Upon being received by this transport service on the mailbox role server, and before the sending server in front of the CAS role has received any acknowledgement of receipt, the transport service connects to another transport service in the same delivery group. That is, if this is a cross-site DAG member, it will attempt to connect to a member of the same DAG in a remote site. If the DAG is not cross-site, or there is no response from up to four remote DAG members, it will connect to a DAG member in the same DAG and same site. If the mailbox server is not a DAG member, it will attempt to connect to up to two servers in the local Active Directory site. These values for cross-site and local site retries can be configured via Set-TransportConfig and can be seen in Figure 2-26.

Figure 2-26

Figure 2-26 Get-TransportConfig showing shadow redundancy related settings

Once connected to a second transport service on another mailbox server in the same delivery group, a copy of the message, the shadow message, will be sent to this server to be kept for two days (or the ShadowMessageAutoDiscardInterval from Get-TransportConfig if this has been changed). The second transport server acknowledges the first for successful receipt of the shadow message. The first transport server acknowledges the frontend transport proxy service of successful receipt of the message and then, and only then (unless connections time out), does the frontend transport service acknowledge the sending SMTP server.

This sequence of events means that should a server fail during receipt of a message or receipt of the shadow, the preceding server will reconnect automatically to redeliver.

Once the transport service has the message in its queue, it will connect to the mailbox transport service on the server containing the active copy of the mailbox. The message is handed to the mailbox transport service over SMTP to port TCP 475 on the actual server holding the active mailbox. If this server should fail during delivery, the holding transport server can redeliver. If the holding transport server should fail while queuing this message, the shadow holder will promote its copy of the message to the primary copy and deliver it should it not receive acknowledgement of successful delivery from the primary transport service after three hours. Three hours is the ShadowHeartbeatTimeoutInterval x ShadowHeartbeatRetryCount, or 12 x 15 minutes = 3 hours.

Both the transport service holding the primary copy of the message, and the shadow copy of the message persist the message in the mail queue database for two days. This persistence of the mail queue database is known as the Safety Net. In the event of delivery failing to the database, or the database failing over to a passive copy of the database and suffering loss of log files, and therefore loss of messages, the copy of the message in the transport mail queue database (the Safety Net) can be redelivered automatically.

In the event of a lagged database copy being mounted without being replayed, for example in the case of logical database corruption such as an active deletion of data or virus outbreak, the mail queue database can replay up to its stored duration and so result in minimal data loss even though the database has been rolled back in time. Therefore, it is important that the duration of time that the mail queue database stores messages, which is two days by default, equals or exceeds the ReplayLagTime value on a mailbox database copy. The SafetyNetHoldTime parameter on Set-TransportConfig defaults to two days and can be increased as the ReplayLagTime can be up to 14 days.

Objective summary

  • Use a single MX record and multiple A records if you have a single site and multiple servers. If you have more than a handful of servers, a load balancer is usually a better option and a single IP address on the MX record.
  • For smarthost values use a FQDN rather than an IP address because it is easier to manage change in the longer term.
  • For multiple sites with inbound SMTP connectivity, either use a cloud hosted filtering service that can direct users’s email to the correct site, for example using conditional routing in EOP. If there is no facility in the cloud filtering service, or a cloud filtering service is not an option, use the highest priority MX record for the primary site.
  • For message resubmission, remember least cost routing. Understanding this will help you determine the servers that will queue messages given specific site, network, and server failures.
  • Use the -Resubmit parameter of Retry-Queue in the event that you want queuing emails to be reevaluated for new routes if changes have been made that will change the least cost route to a target site or DAG.
  • Take a look at http://vanhybrid.com/2013/11/28/script-putting-exchange-server-2013-into-maintenance-mode/ for a script that will help you place a server into maintenance mode.
  • Exchange Server 2013 should have no requirement to modify receive connectors for anonymous submission. For authenticated client submission, the recommendation is to use port 587 and not 25. Authentication on port 587 will allow relay if the authentication is successful.
  • Set-TransportConfig is used to configure the shadow redundancy timeouts and settings such as whether to use a cross-site DAG or only to use the same site (regardless of DAG node location).
  • Ensure that any lagged database has a ReplayLagTime of less than or equal to the SafetyNetHoldTime.

Objective review

Answer the following questions to test your knowledge of the information in this objective. You can find the answers to these questions and explanations of why each answer choice is correct or incorrect in the “Answers” section at the end of this chapter.

  1. Contoso wants to have a 7 day lagged database copy and wants to ensure that their SafetyNet duration is set to the same value. What command would they use?

    1. Get-TransportService | Set-TransportService -SafetyNetHoldTime 7Days
    2. Get-TransportService | Set-TransportService -SafetyNetHoldTime 7:00.00
    3. Set-TransportConfig -SafetyNetHoldTime 7Days
    4. Set-TransportConfig -SafetyNetHoldTime 7:00:00
  2. Which of the following accepted domains can be included in an email address policy?

    1. Authoritative
    2. InternalRelay
    3. OpenRelay
    4. External Relay
  3. Contoso and Fabrikam are two divisions of the same company. Both were historically separate entities and remain so for email due to compliance reasons. Both organizations have an Exchange Server 2013 deployment in two different datacenters and they use rack space at the partner company’s datacenter to host passive DAG nodes. They would also like to use the Internet connection of the partner in the event of an outage with their own connection for inbound mail flow. What do they need to configure in addition to the records pointing to the primary datacenter?

    1. Create the following DNS records:

      contoso.com MX 5 mail.fabrikam.com

      fabrikam.com MX 5 mail.contoso.com

    2. Create the following DNS records:

      contoso.com MX 10 mail.fabrikam.com

      fabrikam.com MX 10 mail.contoso.com

    3. Create the following DNS records:

      contoso.com MX 20 mail.fabrikam.com

      fabrikam.com MX 20 mail.contoso.com

    4. For each organization, create an internal relay accepted domain and a send connector with the matching address space as the accepted domain.
    5. For each organization, create an external relay accepted domain and a send connector with the matching address space as the accepted domain.