Designing High Availability with Microsoft Exchange Server 2010

  • 7/15/2010

Risk Mitigation

Achieving high availability requires that risks are identified and addressed. Many organizations employ risk management practices to capture and address potential disruptions to business processes. These practices usually consist of the following phases:

  • Identification This phase includes the documentation of areas of risk within the business. These range from loss of a large customer and the associated revenue all the way to a disaster that destroys a company datacenter.

  • Assessment This phase includes the analysis of the identified risks to determine the probability and the impact of each.

  • Mitigation This phase includes creating a plan for mitigating each potential risk. The mitigation plans for each risk fall into the following three categories:

    • Acceptance This is done when a risk is accepted, usually because the probability of occurrence is so low it doesn’t require mitigation or the cost outweighs the consequences of the risk. A risk that might fall into this category is the probability of datacenters that are 20 miles apart being affected by the same tornado. Although this is possible, the likelihood is so small that is acceptable.

    • Transference This is done when the risk is mitigated by obtaining insurance or by outsourcing the risk to others to manage. A risk that might fall into this category is outsourcing inbound anti-spam and antivirus services to Microsoft Exchange Hosted Services to handle inbound e-mail.

    • Reduction This is done when the risk can be managed to a point where it is less probable or can be recovered from quickly. A risk that might fall into this category is deploying a cross-site DAG in two datacenters to reduce the likelihood that a single site failure can cause a messaging system outage.

  • Implementation This phase includes putting the risk mitigation into practice.

  • Review This phase evaluates the risk mitigation plan to verify that it has addressed the identified risks and to evaluate whether any new risks have been introduced.

Not only should risk management be practiced at the business level, but it must also be performed for IT solutions, such as the Exchange messaging environment. As you perform risk identification for your messaging environment you may list disk failure, server motherboard failure, loss of Internet connectivity, security breaches, site failures, and employee mistakes as risks. The assessment and mitigation process may create a list similar to the one in Table 11-7.

TABLE 11-7 Exchange Risk Mitigation

RISK

MITIGATION

Mailbox Server Disk Failure

Reduction: Use a RAID configuration or rely on DAG replication.

Server Motherboard Failure

Reduction: Use a DAG for Mailbox servers and deploy multiple Transport and Client Access servers.

DNS Server Failure

Reduction: Deploy multiple DNS servers and configure servers to use them.

Domain Controller Failure

Reduction: Deploy multiple domain controllers in each site.

Network Device Failure

Reduction: Deploy redundant network devices.

Loss of Internet connectivity

Reduction: Add additional Internet providers. Transference: Host servers in a colocation facility.

Security Breaches

Reduction: Good update management; implement intrusion detection and prevention systems. Transference: Outsource security to an experienced third-party provider.

Site Failures

Reduction: Deploy a failover site.

Employee Mistakes

Reduction: Provide training for employees and automate many common tasks.

One of the best ways to mitigate risk is to periodically test any disaster avoidance or recovery practices that have been put into place. This allows these measures to be tested and refined in a controlled environment, and in the end reduces risk. Often small details can be overlooked in a plan that cause delays in the recovery. For some organizations the primary datacenter is colocated in the same facility as the office space. In a situation where the primary facility is no longer viable and the IT systems are operational in the secondary datacenter, the users will still need another location to work. The processes and procedures for accessing the new location and notifying customers must also be worked out.

These fire drills also provide the opportunity to teach the employees the importance the business places on recovery and reinforces the mind-set to work toward that goal during all of their day-to-day responsibilities.