Exam Ref 70-342 Advanced Solutions of Microsoft Exchange Server 2013 (MCSE): Design, Configure, and Manage Site Resiliency

  • 2/10/2015

Objective 2.4: Troubleshoot site-resiliency issues

This objective of the site resiliency chapter looks at different troubleshooting options to consider for diagnosing and finding faults within the items that we have looked at throughout this chapter.

This objective covers how to:

  • Resolve quorum issues
  • Troubleshoot proxy and redirection issues
  • Troubleshoot client connectivity
  • Troubleshoot mail flow
  • Troubleshoot datacenter activation
  • Troubleshoot DAG replication

Resolving quorum issues

For your database availability group to be online and able to mount databases it must have quorum. That is to say it must have a majority of servers in the cluster online. If it does have exactly the minimum number of servers required to reach majority, then the file share witness will have a file locked on it to add an additional vote and maintain quorum. This ensures that should the DAG drop to an equal number of nodes online and offline (or unreachable), the DAG will stay online in the primary site. With the file share witness it means you will always have one datacenter in a multi-site DAG that should be able to reach quorum and the other site would fail to reach quorum. Datacenter Activation Coordination (DAC) mode is an additional check within a DAG to ensure that not only does a majority need to be present but nodes must contact another node with their DACP bit set to 1 before being able to mount their databases. DAC mode is disabled by default but should be enabled on all DAGs with two or more servers.

When you have an outage that takes away a single DAG node, you are taken closer to not having quorum. When quorum is lost, all the databases in the cluster dismount. Therefore, it is important to understand quorum and ensure that network outages and maintenance events would not place you in a position of losing quorum. In Windows Server 2008 R2, quorum is calculated as a majority of the total number of nodes in the cluster and unless you evict nodes from the cluster (as you do in a failover event with Restore-DatabaseAvailabilityGroup) your majority is always calculated from the total number of nodes in the cluster that are online or offline. In Windows Server 2012 R2, as individual servers go offline, the total node count for majority is recalculated and therefore majority can remain with fewer nodes online. Note that this feature, called dynamic quorum, is enabled by default in 2012 R2 but is available in Windows Server 2012.

Troubleshooting quorum is therefore a knowledge of the operating system in use under Exchange Server, the total or online number of nodes and what half + 1 of this count is because half + 1 is the majority.

To find out your cluster configuration in the DAG you can use PowerShell commands such as Get-Cluster, Get-ClusterNode, and Get-ClusterQuorum. Some of these are shown in Figure 2-27.

Figure 2-27

Figure 2-27 Cluster reports via Windows PowerShell

Troubleshooting proxy and redirection issues

All client connectivity to an Exchange 2013 mailbox happens through the Client Access Server role. The CAS role is an intelligent proxy server for Exchange Server clients. It authenticates the client or determines the recipient of the mail message and forwards the network traffic to the server that is active for that mailbox (the server that’s hosting the active/mounted copy of the database with the user’s mailbox in it). Therefore, if client connectivity to a 2013 mailbox is failing there are two places to consider. The first is the CAS role and the second is the mailbox server that is active for that user’s mailbox. If you have multiple CAS role servers, the first step to troubleshoot is to use a different server. If you have a load balancer make sure that it is configured to detect individual protocol health issues and redirect clients when Exchange managed availability updates the Healthcheck.htm file for each protocol.

If the CAS servers are working okay and are able to proxy the connection to a mailbox server, but there is still no connectivity, you would troubleshoot issues with opening the actual mailbox. Also consider using various protocols to see if it was protocol specific. Exchange Server can quarantine mailboxes that introduce performance issues to the server and a mailbox in quarantine would be unavailable, whereas other mailboxes on the same database would be working fine.

When the mailbox is located on a legacy Exchange Server and the client connects to a CAS 2013 server, the CAS 2013 server will prefer to proxy the connection to the target Exchange version unless the connection is to OWA on an Exchange 2007 server, or to OWA or ECP on an Exchange 2010 server with the External URL set. In these two cases the CAS 2013 server will redirect to the legacy URL for OWA 2007, and the External URL for OWA/ECP 2010. Exchange 2013, apart from the two cases above, will proxy to the FQDN of the 2007 or 2010 Exchange Server. Therefore, for legacy Exchange connectivity each Exchange Server needs to be able to connect directly to the FQDN of each legacy server in the organization. It is worth pointing out that firewalls between Exchange Servers and other Exchange Servers and domain controllers are not supported.

Troubleshooting client connectivity

When you have client connectivity issues, the first piece of troubleshooting is to see the scope of the issue. Does it affect everyone, or one person, or somewhere in between? Once you have a scope to the issue you can look for something that might be common between all of the users. For example, are they all in the same database or site or something that would allow you to tie the connection issue to something you can go and investigate?

For client connectivity, another great tool is the ability to use multiple client types to connect to Exchange Server. For example, if Outlook is having an issue, do you get the same with OWA? If SMTP is having an issue, what about IMAP or POP3, if you are using a client that uses these protocols? If you can limit connectivity issues to a given client type, that will help. For issues where clients cannot connect, but you are able to open a web browser and login to OWA, start troubleshooting at any recent changes and consider settings and configuration such as AutoDiscover as part of client troubleshooting because OWA does not need AutoDiscover (unless you are using OWA for Devices on the iPhone/iPad or Android phone because AutoDiscover is used by these apps).

Once you have a scope of the issue, and a client that the problem is exhibited in, you are a good way to looking for an answer. Always examine the event viewer logs on a server to see if an issue is caused by something that might be surfaced in the logs, and always be very careful about making changes to fix issues without a good understanding of the issue first, as a change could compound the problem.

Troubleshooting mail flow

To understand and troubleshoot mail flow is to understand the SMTP protocol, which Exchange Server services do what, and what the connectors and other configuration items are used for.

Troubleshooting connectors

Receive connectors in Exchange Server are the SMTP server. To send email into and across an Exchange Server organization, email is accepted by a receive connector. When there is more than one receive connector on a target machine, the connectors need to be configured so that they either listen on a unique IP address or port (the connectors binding) or that they answer for specific ranges of IP addresses. Troubleshooting for specific bindings is easy. As long as you can make a connection to the binding IP and port, you know connectivity is working. But how can you tell if you have connected to the right receive connector when the connector is supposed to answer you based on your source IP address? The easiest way to do this is to configure the banner property of the receive connector. The banner is the message that starts with 220 that you receive when you connect to a receive connector. If you set each receive connector on a server to a unique value, you can clearly tell which receive connector you have reached when you connect to it on its listening IP address and port.

Once you have made a successful connection to the listening IP address and port, it is also useful to be able to enter the commonly expected SMTP verbs by way of a telnet session. To open a session using the telnet client to a remote Exchange Server, use telnet remote_IP_address port. If you were trying to connect to server 131.107.2.200, you would type telnet 131.107.2.200 25 and you would expect to see 220 and the configured banner, or if it’s the default banner, you would expect to see 220 and the server name and date/time. Once you have connected, use the EHLO domain.com command to say hello to the remote server and to tell it your domain name. It should respond with 250 OK or 250-SomeSMTPVerbs and then 250 VERB. The last line will read 250 space verb. All of the other lines will read 250 hyphen verb.

After the supported verbs are returned, try MAIL FROM: email.address@domain.com and then RCPT TO: valid.address@recipient.domain.com. When telnetting into Exchange Server on port 25, you cannot enter an external email address unless you are connecting to a receive connector that allows for relay.

The DATA command follows the successful entry of the MAIL FROM and RCPT TO commands. You can have one MAIL FROM and one or more RCPT TO commands. One DATA command ends the message envelope and moves onto the message body.

In the message body, enter To:, From:, and Subject: all with valid values after the colon and each on their own line. After Subject: have a blank line to end the headers, and then type the message body. Finish the message with a period on a line on its own followed by QUIT.

If in any of the verbs typed previously, you do not get the expected response, (for example the response shown in Figure 2-28), you have further troubleshooting to do. The most common reason why an anonymous connection will fail to an Exchange Server is if the server is out of resources such as disk space or memory and is known to be in a state called backpressure. A look in the event logs will give the reason. When this is resolved, mail flow will automatically resume.

Figure 2-28

Figure 2-28 Using telnet client to successfully connect to an Exchange Server receive connector

Outbound connectors, or send connectors on the transport service, will queue messages that cannot be delivered. All of the other send connectors on other services are stateless and do not queue messages. Therefore, if there is a problem with a send connector that uses a Frontend CAS to proxy through, it will queue in the transport service. If the destination is offline or otherwise unavailable, the message will queue in the transport server that holds the send connector to that destination. If the transport service is offline, the mailbox transport submission service, which delivers messages between the mailbox and the transport service, will not queue and the message will stay in the outbox in the client.

When messages are queued on the transport service you can use Get-Queue or Get-QueueDigest to review the queue on a given machine, or across all of the machines in the DAG or site. Get-Queue | Format-List LastError will return the last error on any given queue. Sometimes you will not get an error on a queue, but will get errors on the messages in the queue, and for this you need to use Get-Message | Format-List LastError instead.

Troubleshooting transport services

In Exchange Server 2013, there are a number of transport services. These are as follows (with the process name in brackets) and the role the service runs on listed as well:

  • Mailbox server role

    • Transport (EdgeTransport.exe)
    • Mailbox transport delivery service (MSExchangeDelivery.exe)
    • Mailbox transport submission service (MSExchangeSubmission.exe)
  • Client Access Server role

    • Frontend transport (MSExchangeFrontendTransport.exe)

The frontend transport and the two mailbox transport services are stateless, that is they proxy messages and do not store them on disk. Frontend transport finds the correct mailbox server to proxy the message to, that is it will deliver the message to any server in the same DAG or site (if not a DAG member) as the active mailbox and to any local mailbox server in the same site for messages going to legacy servers or distribution groups.

If frontend transport is not running, TCP port 25 will not be listening on the CAS server. The transport service on the mailbox server will listen on TCP 25 if it is a mailbox only role server, but on TCP 2525 if it is co-located with a CAS role server, so that CAS only listens on TCP 25. You cannot have two services listening on the same port, though in Exchange 2013 it is possible to build receive connectors on the transport service that listen on port 25 when CAS is also listening on that port. This can cause lots of issues, so ensure on all co-located servers that you always bind receive connectors to frontend transport service.

The mailbox transport services send messages to databases (mailbox delivery) and receive from the mailbox databases (mailbox submission). If either of these services are offline then sending or receiving from the database will not occur.

Therefore, look up maintenance mode for Exchange Server because that is how you tell the Health Manager not to restart stuff or attempt to fix stuff if you have the server offline or partially offline on purpose.

Troubleshooting transport-related configuration

Change is usually the biggest cause of outages in IT systems. For example, someone has changed something and now something is broken. For transport, the objects that you need to configure to ensure valid mail flow typically work until something changes in them or the send to or receive from targets change, such as a smarthost of firewall rule changes and now the smarthost is unreachable.

Always have change control and keep a record of all configurations before they are changed and after they are changed. In Exchange Management Shell (and Windows PowerShell in general), this is easy to implement with the use of Start-Transcript and then at the end Stop-Transcript. This records everything you do to a log file. Therefore before you make changes, for example to an Accepted Domain, you would run Get-AcceptedDomain | fl to write to the screen and also to the transcript log file, the configuration you have in place at this time. Then make your changes. If you need to role these changes back, you have what you need to role it back to.

If you use ECP to make changes, remember that the admin audit log can be queried to show you what you have changed, but it will not show you what it was before the change was made!

Troubleshoot datacenter activation

If you have a site failover and you need to activate passive copy databases in your secondary datacenter, you need to ensure that you use the Stop-DatabaseAvailabilityGroup and the Restore-DatabaseAvailabilityGroup cmdlets. This adds the servers that are part of the cluster in the failed site (Stop-DatabaseAvailabilityGroup-ActiveDirectorySite PrimarySiteName) to the DAG stopped servers list, and then the Restore cmdlet evicts them from the cluster and reduces the node count so that majority can be obtained in the secondary/surviving datacenter.

Unless you have the file share witness in a third site that the secondary datacenter can access and the primary cannot, you must do manual processes like the one described here to perform a failover to the other site. With a file share witness in the third site that both sites have independent access to, you can have automatic failover as long as both sites hold an even number of cluster nodes and at the point of failure, the primary site goes down, but the third site with the file share witness does not then automatic failover occurs as majority is maintained.

If you do not have DAC mode enabled, which you should on a two or mode node DAG, then you need to use cluster commands to assist in the failover process. With DAC mode, as well as stopping split-brain scenarios, this allows you to use Exchange cmdlets only to manage the DAG instead of needing to know the additional cluster commands as well.

Troubleshooting DAG replication

Unless it is otherwise changed, DAG replication occurs over port 64327.Therefore, this port should be open for connectivity between nodes of the same DAG, though of course it is not supported to have any firewall between any Exchange Server and it can generate unexpected results.

For the Database Availability Group, replication happens on the replication network if one has been designated. Ports need to be open for this connectivity on the network that Exchange expects it to be on.

To see the state of the replication of your DAG, use Get-DatabaseAvailabilityGroup to find the DAG settings. Use Get-MailboxDatabaseCopyStatus to find the state of the replication and which servers are active for which databases, such as. which server the database is mounted on. The copy and replay queue lengths, described below, should ideally be low (<10) unless you are looking at a lagged database, which will have a high replay queue but should still have a low copy queue.

The Get-MailboxDatabaseCopyStatus cmdlet will return the database status of the local machine and show the health of the database. One server in the DAG should have the database Mounted, and other servers should have the database Healthy. Disconnected and other states should be investigated. From Figure 2-29, you can see Get-MailboxDatabaseCopyStatus run against the local server (with no additional parameters) and against a remote server (Get-MailboxDatabaseCopyStatus -Server servername).

Figure 2-29

Figure 2-29 Get-MailboxDatabaseCopyStatus against two servers in the same DAG

Figure 2-29 shows that the databases that are working are mounted on the second server in the output and healthy on the first with no copy queue or replay queue. One database is offline and has an unreachable source database, hence the large number of items to copy, - which is actually showing an error rather than real count of logs outstanding.

Using Get-MailboxDatabaseCopyStatus, you can find out the count of transaction logs that are created on the server that holds the mounted, or active, copy of the database and that need to be shipped to all the other copies of the database. They are copied from the active copy to each passive copy. There is no passive-to-passive copying. It is important to note that if you have multiple passive copies of a database on the far side of a WAN link, the logs will be copied once per passive copy and will require double or more bandwidth.

Use Get-MailboxDatabaseCopyStatus on each passive copy to see the log copy status. If a server is behind on its log copy, it will have a higher than expected value. You should troubleshoot the log replication process from the active to that passive server.

Once logs arrive on the passive server, they are inspected for integrity and copied again if they fail the inspection. They are then written into the passive database replica. If there are issues writing the log into the database, for example on a disk with poor write speeds, the replay log count will increase. On a lagged database copy, there will always be a replay queue length of the number of transaction logs generated in the time window that the database is lagged by. On a server with a generally consistent level of activity and mail flow, this number will generally be the same from one day to the next at the same time of day. It will fluctuate over the day and week because the active copy will change in terms of its activity levels. For a lagged database copy, consider the values of the replay queue length that you expect and ensure that these are not massively over your expectations.

If you have a large copy queue length or a large replay queue length on a passive server, you will need the disk space to store these logs. For a large copy queue length, it will mean the logs are not removed from the active copy even if the active copy is backed up. Always attempt to fix the copy or replay issue rather than manually deleting log files. If you delete a log file that is required by a passive copy of that database, you will most likely have to reseed the entire database.

Objective summary

  • A DAG needs to maintain quorum for databases to remain mounted in it. You looked at ensuring that maintenance, patching, etc. do not cause an entire DAG outage due to less than the majority of the DAG nodes remaining online.
  • Unexpected failures can happen. Do not actively shutdown servers for maintenance that will bring you close to losing majority.
  • If you evict servers from a cluster and therefore from the DAG, you will need to copy the entire database back to that server when you add it back into the DAG unless all of the log files containing all of the changes are still available on the active node. Do not remove servers from clusters unless there is a site failover or the end of the server’s role within the DAG.
  • For the most part, Exchange will proxy connections through the CAS role to the active mailbox server or to a legacy server in the same Active Directory site as the mailbox. There are only a few occasions where a redirection occurs and these are for OWA 2007, where the user is redirected to the legacy namespace (which is required), and OWA 2010 where the user is redirected to the ExternalURL if one is set.
  • In the case of redirecting from Exchange 2013 to Exchange 2010 with an ExternalURL set, you will need to authenticate again if you are running a lower cumulative update release.
  • Use the Health Manager service and Managed Availability to try to ensure that Exchange Server remains functional and healthy, and to failover databases and restart services/servers to try and resolve issues.
  • Use the Get-MailboxDatabaseCopyStatus cmdlet to know the copy status of your databases and to help ensure your databases remain with their replicas up to date.

Objective review

Answer the following questions to test your knowledge of the information in this objective. You can find the answers to these questions and explanations of why each answer choice is correct or Incorrect in the “Answers” section at the end of this chapter.

  1. Which of the following Windows PowerShell commands will return the list of servers and the state of the servers in a cluster?

    1. Get-ClusterNode
    2. Get-ClusterServer
    3. Get-Cluster <Name> | FL *node*
    4. Cluster.exe Node
  2. You notice that when using Get-MailboxDatabaseCopyStatus on a server that hosts only passive database copies, you have a large copy queue length of over 10,000 logs for one of these databases. Which of the following could be the potential impacts of this issue?

    1. Backups will not truncate log files.
    2. Disk space for logs might run out.
    3. The active database might dismount.
    4. The transaction logs on the lagged copy will auto play forward.
  3. What does the RCPT SMTP verb do?

    1. It tells the SMTP server to send a read receipt.
    2. It tells the SMTP client to send a read receipt.
    3. It tells the SMTP client that the email has been received.
    4. It tells the SMTP server the email address of the recipients of the email.