How to Install, Configure, and Manage the Mailbox Role in Microsoft Exchange Server 2013

  • 5/29/2015

Objective 1.4: Monitor and troubleshoot the mailbox role

Maintaining a healthy and highly available Exchange 2013 environment requires monitoring the environment for issues affecting database replication, database copy activation, and mailbox role performance. Managed availability provides the monitoring and remediation of known issues when possible. But, that shouldn’t replace monitoring of the environment for misconfigurations or other environmental health issues, which can potentially result in larger unplanned outages if not addressed early.

Troubleshooting database replication and replay

In a normal operation, transaction logs are replicated to database copies, inspected for errors, and, if no errors are encountered, they’re replayed into the database copy. In case of lagged copy of a database, the logs are inspected, but not replayed until the lagged copy meets the lag time requirements. Log truncation also occurs on the active copy of the database when the truncation criteria is met and the process requires all of the copies to be healthy. All of the database copies must have replayed the log file to be truncated. In the case of lagged copies, the logs must have been inspected successfully. If one of the database copies doesn’t meet this criteria, log truncation can’t occur, even if circular logging is configured or the database backup has successfully completed.

If a database copy is offline or unreachable, it can cause a problem in log replication and truncation. This is because an active copy won’t truncate any logs until all of the copies are verified. When the logs aren’t truncated, all of the database copies, including the active ones, keep accumulating logs. This creates the potential of running out of disk space if the faulty database copy isn’t remediated or removed to allow the truncation process to resume.

When planned maintenance takes an extended amount of time, and unplanned outages make database copies unavailable, both developments affect database copies and the log truncation process.

You can identify a copy with problems by running the Get-MailboxDatabaseCopyStatus cmdlet. Any copies with a copy queue length greater than zero, replay queue length greater than zero, or a failed or suspended state need to be investigated for cause and must be remediated.

When you have a database with a copy queue length greater than zero, the replication service is unable to replicate the required log files from the active database copy to the given replica. If the problem is on the source server, all of the passive copies of the database will have a copy queue length greater than zero. This usually occurs when a required log file is missing. This could be the result of a misbehaving or misconfigured anti-virus, or even an accidental delete by an administrator. In such instances, restoring the missing file becomes necessary before the replication can resume.

Once the missing log file is restored, run the Resume-MailboxDatabaseCopy cmdlet to resume the replication of log files to the passive database copies.

If mailbox servers hosting passive database are configured with a different disk layout and capacity, or if the disk hosting replica is shared with another application for storage, it may run out of disk space before the expected log truncation can occur. In this case, the affected database copy will have the copy queue length greater than zero. To resume log file copy from active database, address the disk space issue on the target server.

When you have a database with a replay queue length greater than zero, the replication service is unable to replay the received log files into the database copy.

In addition to the previously mentioned disk space and file level permission issues, this can also be caused by log file inspection failing to successfully inspect the received log files. Corruption of a received log file or file level anti-virus scanners are the common culprits, but they aren’t the only ones.

When the database copy status is FailedandSuspended, the replication to the database is suspended and it is going to impact the log truncation process, as previously discussed. When a database copy is in this state, the detection of a failure requires manual intervention.

A common cause for this error is when the server is unable to mount the database for the replay of log files, or the database has diverged from the active mailbox database to the point where it must be updated manually using the Update-MailboxDatabaseCopy cmdlet. As discussed in the previous section, Managing mailbox database copies, you can specify which database copy should be used as a source if the target server is in a remote site and you need to avoid replication over WAN links.

The incremental resync feature included in Exchange 2013 is designed to automatically correct a divergence between database copies. When the incremental resync detects divergence, it searches a log file stream to locate a point of divergence, locates changed database pages, and then requests them from active copy. The changes are applied to the diverged database copy to bring it back in sync with the primary copy. Important to note is that when a database has reached failed and suspended status, the divergence can’t be repaired by the incremental resync process and manual intervention becomes a necessity.

The database replication process also includes a content index catalog. The content index catalog is one of the components included in health checks, which is used by the BCS process. When a content index is corrupt, the Get-MailboxDatabaseCopyStatus shows the index state as FailedAndSuspended. Similar to the failed and suspended state of a mailbox database, the content index can be fixed by running the Update-MailboxDatabaseCopy cmdlet with the parameter CatalogOnly.

Troubleshooting database copy activation

For DAG to provide protection from failures and provide the ability to perform scheduled maintenance without affecting users, the passive copy of the database must be healthy and be able to mount as active copy when needed.

Activating a database copy is a complex operation involving many components, such as Active Manager, cluster service, and quorum and network components. Not only does a database need to be healthy, but the underlying components must also be healthy and functional.

When a database copy fails to mount, troubleshooting depends on symptoms and a combination of other factors. A methodical approach to troubleshooting yields the best results. Exchange 2013 also provides numerous events and tools that can be used to determine the status and possibly cause of the problem you’re trying to troubleshoot. The proactive use of such tools can help prevent an unexpected outage.

One such tool is the Test-ReplicationHealth cmdlet. This cmdlet is designed to provide on demand an inspection of continuous replication, an availability of the Active Manager, the health and status of cluster service, and the quorum and network components. The cmdlet can be run locally on a mailbox server or remotely against a mailbox server that’s a member of a DAG. The following is a sample output of the Test-ReplicationHealth cmdlet.

Sample output from Test-ReplicationHealth cmdlet

[PS] C:\>Test-ReplicationHealth

Server         Check                     Result    Error
------         -----                     ------    -----
Server1        ReplayService             Passed
Server1        ActiveManager             Passed
Server1        TasksRpcListener          Passed
Server1        DatabaseRedundancy        *FAILED*  There were database...
Server1        DatabaseAvailability      *FAILED*  There were database...

Each check against the given server checks the individual component or criteria for success or failure. You might have noticed that Server1 in the previous example has passed three checks and failed two. The first three checks are to ensure replication service is running, Active Manager is running and has a valid Primary Active Manager or Secondary Active Manager role, and the tasks listener is running and listening for remote requests.

The database redundancy and availability checks ensure that you have more than one copy of the database configured and that those copies are healthy.

When the first three checks fail, you need to ensure that the relevant services are running and, in case of Active Manager, the cluster service is functioning and Active Manager can communicate with other DAG members to achieve quorum.

If the database redundancy and availability checks fail, first you need to make sure the database in error is configured to have more than one copy. And for the databases with multiple copies, check the reason of failure by checking the detail status of each component provided by the cmdlet.

The replication issues previously discussed can also be a contributing factor to the redundancy and availability check failures. Be sure to perform the necessary troubleshooting, as discussed earlier.

Besides replication and copy configuration issues, database copy activation is also affected by configuration, which might not necessarily be a misconfiguration.

For example, a mailbox server can be configured to block database activation on a given server. This is usually the case when an administrator wants to perform maintenance on the server and has configured the server to avoid the activation of databases during the maintenance window. It is also possible to configure the DatabaseCopyAutoActivationPolicy parameter of the Set-MailboxServer cmdlet to the value IntrasiteOnly. This configuration enables an administrator to restrict the activation of the databases to the same site as the server where the database is currently active. This prevents cross-site failover and activation. While this isn’t a misconfiguration, it can certainly block the activation of a database copy on a given server.

Other configuration parameters that can affect database activation on a server are MaximumActiveDatabases and MaximumPreferredActiveDatabases. These parameters are designed to provide a mechanism that can help address design requirements.

For example, if a mailbox server is designed to host 10 active databases with 5,000 users each, the server can still host more than 10 active database copies. This creates a potential of degraded server performance when more databases on the server are activated than the server is designed to handle. The MaximumActiveDatabases and MaximumPreferredActiveDatabases are designed to protect against such degradation by enabling administrators to configure preferred active database value. Limiting maximum number of active and preferred databases can help optimize server performance by hosting only the number of databases the server is designed to handle.. While it might seem that two parameters have the potential of conflict, MaximumPreferredActiveDatabases is only honored during the best copy and server selection, the database and server switchovers, and when rebalancing the DAG. So, preferred active database limit is a soft limit that should be configured for lower optimum active number of databases, whereas, the maximum active databases should be a number higher than the preferred active database number and should match the designed mailbox server capacity for maximum active databases.

When a database fails to mount, ensure you’re not only checking for errors or database copy, Active Manager, cluster, network and server health conditions, but are also accounting for configuration parameters that might block activation of a database copy on a given server.

Troubleshooting mailbox role performance

When a server is unavailable, redundancy features for transport and high availability features for a mailbox role continue to provide service to end users. But what happens when a server is functional, but its performance is severely degraded?

Exchange 2013 has numerous workloads, each with its defined function. Replication service, for example, is responsible for the replication of log files to database copies, among other functions, and transport component is responsible for the routing of messages. Each resource consumes system resources, such as CPU, memory, and network resources.

Each user connecting to the Exchange 2013 servers also consumes resources. The client application or mobile devices they use can have a direct impact on how many resources are consumed by a user. Actions taken by a user, such as changing a view in Outlook or performing a long-running search query against an archive mailbox, can also have an impact on the mailbox server resources. Third-party applications connecting to Exchange using one of many protocols also have an impact on resource consumption on mailbox servers.

Exchange 2010 provided user-throttling functionality, which allowed controlling how resources are consumed by individual users. This capability is available and is expanded for Exchange 2013.

When released, Exchange 2013 also offers system workload management, which applies to system components and their impact on resource usage. The cmdlets enabling you to manage system workloads have been deprecated. The deprecated cmdlets include *-ResourcePolicy, *-WorkloadManagementPolicy, and *-WorkloadPolicy system workload management cmdlets.

New features in Exchange 2013 enable users to increase resource consumption for short periods without experiencing throttling or complete lockout. While lockout can still occur if users consume large amounts of resources, the lockout is temporary and the user is unblocked automatically as soon as usage budgets are recharged. You can set the rate at which users’ resource budgets are recharged. Exchange 2013 also uses burst allowances to let users consume a higher amount of resources for short periods of time without any throttling, while implementing traffic shaping to introduce small delays, before user activity causes a significant impact on the server. Introducing small delays reduces the request rate from the user, but it’s mostly unnoticeable by the user. This mechanism also helps prevent or reduce user lockouts.

Throttling policies in Exchange 2013 are managed by scopes. The built-in throttling policy has Global scope. This policy applies to all users in your organization, but it shouldn’t be confused with the policy that has an Organization scope. The purpose of the organization policy is to allow customization of throttling parameters, which has different values from the defined default values in global policy. If you need to customize any of the built-in throttling parameter values, you shouldn’t modify global policy, since it might be overwritten by future updates. Instead, you should create an organization policy and include only parameters that have a different value from global policy. This policy applies to all users.

You can also create a policy with the throttling scope as Regular. These policies can be applied to individual users, instead of the Global scope of the abovementioned policies. The Regular scope policies are quite useful when you need to change throttling behavior for only a small subset of users or applications.

To manage throttling configuration, use the *-ThrottlingPolicy cmdlets. For example, you can use the New-ThrottlingPolicy cmdlet to create a new throttling policy with the Regularthrottling policy scope. After customizing the required parameters, you can assign this policy to individual user mailboxes as needed, using the Set-ThrottlingPolicyAssociation cmdlet. Or, you can also configure throttling policy assigned to a user using set-Mailbox cmdlet. Many resources can be applied to a policy. You can refer to the individual parameters in this TechNet article at http://technet.microsoft.com/en-us/library/dd351045(v=exchg.150).aspx.

Monitoring database replication

Exchange 2013 provides built-in mechanisms to monitor database replication and database failovers.

Mailbox database copy status provides vital information about given database copies. Although you read about this earlier, let’s look at some of the status information the mailbox database copy returns and what it means.

  • Failed When a database is in a failed state, the copy is unable to copy and replay log files, and it isn’t suspended by administrative action. Because the copy isn’t suspended, the system retries the failed operation periodically. If the system succeeds (for example, when the transient issue is resolved), the copy is marked as healthy.
  • Suspended The database copy state changes to suspended when administrative action, such as running the Suspend-MailboxDatabaseCopy cmdlet, suspends the database copy. This isn’t an error state because it’s the direct result of an administrative action.
  • Healthy The database copy is copying and replaying log files successfully.
  • ServiceDown When the Microsoft Exchange Replication Service isn’t reachable or isn’t running on the server that hosts the database copy, this state is reported. Manual intervention to remediate the faulty service is required.
  • Resynchronizing The mailbox database copy is suspected to have diverged from the active database. The system compares a diverged database copy with an active copy and tries to detect and resolve a divergence. The database copy returns to a healthy state if the divergence is resolved. If the error can’t be addressed, the copy is transitioned to a FailedAndSuspended state.
  • DisconnectedAndHealthy This state is an indication that the database copy was in a healthy state before the loss of connectivity between an active database copy and the database copy reporting this state. Investigate network communication to remediate.
  • FailedAndSuspended When a database copy is in this state, it requires manual intervention to remediate the underlying issue that caused the copy to fail. Unlike the Failed state, the system won’t retry the failed operation periodically.

Because of the verbosity and variety of status reported, the Get-MAilboxDatabaseCopyStatus can serve as a great monitoring and troubleshooting tool for database copies.

The Test-ReplicationHealth cmdlet is another such tool that can provide great insight into the replication of health of database copies, as previously discussed.

Another great source of information regarding high-availability system state and mailbox database failures is crimson channel event logs. In addition to the well-known Application, Security, and System event logs provided by Windows, a new event channel was introduced in Windows Server. Crimson channel event logs store events from a single application or a component, making it easier for the administrator to find relevant events.

Exchange 2013 logs events to crimson channels HighAvailability and MailboxDatabase FailureItems for DAG and database copies. The HighAvailability channel contains events related to startup and shutdown of the replication service. The HighAvailability channel is also used by Active Manager to log events related to Active Manager role monitoring and database action events, such as database mount operations and log truncation.

The MailboxDatabaseFailureItems channel is used to log events associated with any database copy failures.

When the database copies failover without administrative action, it might be important to find out what caused the database copy to failover, whether it was an administrative action, and why a passive copy was selected for activation. While this information is logged in the crimson event channels mentioned earlier, a correlation of multiple related events may be time-consuming. Exchange 2013 includes a script called CollectOverMetrics.ps1, which reads DAG member event logs and gathers information about database operations over a specified time period. The result of running this script can provide insight into information, such as the time at which switchover/failover operation started and ended, the server on which the database was mounted, the reason for operation such as administrative action or a failure, and if the operation completed successfully or failed to complete. The output is written to a CSV file and an HTML summary report can also be generated.

CollectReplicationMetrics.ps1 is another such script that collects metric in real time. The script collects data from performance counters related to database replication. The script can collect performance counter data from multiple mailbox servers, write the data to a CSV file, and report various statistics.

Objective summary

  • The mailbox role performance is actively managed and internal processes are automatically throttled when the system is under stress and required resources could be scarce. Exchange 2013 allowed for the configuration of system workload policies on release, but an improper configuration might cause adverse effects, hence, the *-WorkLoadPolicy cmdlets have since been deprecated.
  • User actions could have an adverse impact on server performance. Exchange 2013 includes a default Global throttling policy to prevent a user or a third-party application from monopolizing resources on the server. If a change to the built-in throttling parameters is required, the best practice is to create a new Organization throttling policy and include parameters that differ from the built-in policy. A throttling policy with Regular scope can also be created if changes only need to apply to a single user or a subset of users.
  • Database replication, replay, and copy activation functionality is dependent on many environmental health and configuration factors. Anything from disk space issues to network connectivity can affect availability of a database copy or failure to replicate data from active copy to other copies. Built-in Exchange cmdlets and event logs provide important insight into what could be a potential cause and understanding status codes can help reduce the time to resolve the issue by methodically approaching the troubleshooting and remediation.

Objective review

Answer the following questions to test your knowledge of the information in this objective. You can find the answers to these questions and explanations of why each answer choice is correct or incorrect in the “Answers” section at the end of this chapter.

  1. When a mailbox database copy is activated on a different mailbox server, you’re asked to determine whether the copy failed as the result of an error on the active copy or because of an administrative action. Which of the following tools would you use? Choose all that apply.

    1. CollectOverMetrics.ps1.
    2. Crimson event logs.
    3. Search-AdminAuditLog.
    4. Get-DatabaseAvailabilityGroup
  2. When troubleshooting replication errors for a database copy, you notice all the copies of the database have a copy queue length greater than zero. You verified that all servers hosting passive database copies are able to communicate to the server hosting active copy. Which of the following has the potential to cause this issue?

    1. Low disk space on servers hosting replica database copies.
    2. The required log file is missing on the server hosting the primary copy.
    3. A network issue resulting in the transmission failure of required log files.
    4. TCP chimney offload configuration is incorrect on nerwork adapter.
  3. When troubleshooting a DAG, you noticed that performance on a Mailbox server is degraded. You noticed that it has more active mailbox databases than the server is designed to host. Which action can help ensure only defined number of mailbox databases can be active at a time?

    1. Run Set-MailboxServer cmdlet.
    2. Run Update-MailboxDatabaseCopy cmdlet.
    3. Run Set-DatabaseAvailabilityGroup cmdlet.
    4. Run Add-ServerMonitoringOverride cmdlet.