Disaster Recovery and Data Integrity

 

Introduction.  A disaster recovery plan attempts to anticipate what disasters could hit an organization and set out a plan for responding to these disasters.  The organization needs to develop approaches that will diminish the impact of potential disasters and prepare for quick restoration of services.
  • What disasters could affect your site?
  • What are the chances that each of these disasters might occur?
  • What would be the costs for each of these disasters if they do occur?
  • How quickly do various aspects of the organization need to recover?

A disaster with respect to computing services is a catastrophic event that causes a massive outage or reduction in performance affecting an important aspect of the computing system.  The following list outlines some potential disasters.

  • Natural disasters
    • earthquake
    • hurricane
    • tornado
    • plague
    • lightning strike
    • fire
    • flood
  • Unnatural disasters
    • bomb
    • massive power outage
    • system intrusions

Risk Analysis.  The first step in developing a viable plan is to do some risk analysis.  Often times this is best performed by outside specialists.  Some risks are much more likely to occur.  Consider firms that have their buildings on or near to fault lines.  These firms are much more at risk from earthquakes than firms that are sizably distant from known earthquake zones.  Other firms may be quite close to a coastline that has some tendencies to endure hurricanes.  In this instance it is also important to assess what sorts of damage is most likely to occur.

More typical sorts of disasters have to do with power spikes or even outages.  It is usually important have some sort of system to make sure that spikes cannot directly impact your systems.  It is also typical to have some level of backup power sources such as batteries or maybe even self-standing power supplies.  The necessity of these obviously depends on the situation and it is important to evaluate for these eventualities.

Some firms may also have legal obligations to other firms.  For example, automatic teller machines for a fairly large variety of banks went down when flooding occurred in New Jersey.  All of these banks were outsourcing their ATM with the same vendor.  When very surprising flooding took out the provider many banks were without ATM service for almost a week.

Internet infrastructure providers can have similar obligations.  One stock exchange had contracted with a variety of firms to ensure there was a continuous flow of information between San Francisco and New York City.  Their thinking was that by having alternative connection sources they would be much less vulnerable to outages.  It turned out that one day a backhoe operator in California took out a single major line and this alone took down the data flow.  Little did the exchange know that all of their vendors were contracting their cabling connection from the same source.

Any sorts of legal obligations need to be thoroughly developed in conjunction with the legal department.  These obligations translate into requirements in the disaster recovery plan.

Preparation.  It is almost always important to develop strategies to reduce the impact and costs of disasters.  Sometimes this can be done at little additional cost, though this is not always likely to be possible.  For example, the stock exchange could have limited their vulnerability to data flow breakdowns with just a bit more in depth investigation.  The impact on costs would have been fairly minimal.

In the case of potential flooding it may well be enough to make certain that particular buildings are not in flood plains and that certain capabilities are fairly well out of reach of high water.  The impact of minor earthquakes can also be mitigated with little  additional effort beyond what should be done by everyone everywhere.

In spite of designing to diminish the impact of disasters, you are still going to need to be able to recover from the unforeseen.  It is important to be able to restore essential systems into working condition in a timely manner. 

Restoring essential systems can actually mean rebuilding data and services on new equipment if the old equipment is not operational.  If this might prove to be the case then you need to prearrange sources for replacement hardware.  This may also imply that you have some sort of agreements to get priority over others in the region to have your needs fulfilled.  It also implies you have some very good and relatively invulnerable backups provided elsewhere.

Things like power, telephone and network connectivity are often some of the most basic services you need to have functioning in order to get other services up and running again.

Power Backup.  If there is sufficient importance, development of a data center equipped for all forms of power backup from alternative power sources such as generators and batteries is essential.  This degree of centralization of essential resources has many other advantages also.

Without these sorts of elaborate developments, then the organization is likely to at least make sure that equipment has alternative power sources within and plenty of surge protection. 

UPSs provide a limited amount of power backup based on batteries.  UPSs are designed to ensure a continuous flow of electricity.  These sources usually don't provide this service for particularly long durations of time.

Generators actually create electricity.  They are likely to be powered by gasoline, kerosene or some other fuel.  Using generators is essential if you are trying ensure you can have electricity even during more sustained power outages.

Data Integrity.  While we keep mentioning this indirectly, you also need to make sure that data is not altered by external sources.  If it is, then you also need to make sure you can restore it to what it should be.

  • Backups
  • Virus checking
  • Firewalls and security
  • Cleverness even in the face of the obvious

This loss of data integrity can arise from all kinds of disasters including computer viruses or competitive espionage.  The ability to reproduce data can also be essential to proving intellectual property rights in a court of law.

Most of these fall under the aegis of security development for the overall system and should already be well taken care of.  But it still is important to consider these things when designing recovery plans and intentions.

Data Backup.  Storage devices can and do crash!  It is important that there are other places to maintain copies of at least relatively recent important data sources.  There are many questions that need to be asked and answered relative to data backup.

  • What should you backup?
    • backup everything?
      • installations
      • data
      • operating systems
    • important data
      • what's important?
    • policies need to be explicit
  • When should you backup?
    • there needs to be a schedule
    • usually not done during important business hours
    • automate if reasonable
    • extent of backup
      • full - likely to very intermittent
      • differential - only backup changes since last full backup
      • incremental - anything that has changed
  • How should you backup?
    • what medium will be used?
      • tapes
      • hard disks
      • CDs
      • SAN - Storage Area Network
    • what software will be used?
      • Windows Backup
      • NetWare SBackup
      • UNIX tar
      • third party usually fuller featured
        • ARCServer - Cheyenne
        • Backup Exec - Seagate
        • Norton Backup - Symantec
  • How automated should the backup be?

Disk Fault Tolerance.  Disk fault tolerance involves combining multiple hard drives on a computer so that they can compensate for loss of integrity for each other.  Disk fault tolerance is also called RAID - Redundant Array of Independent Disks.  Some of the most common RAID configurations are outlined below.

  • RAID Level 1
    • Disk Mirroring
      • generally requires two physical hard disks usually of same size
      • one disk is mirrored in the second
      • updates can occur automatically
      • updates can be invoked manually
    • Disk Duplexing
      • same as disk mirroring except the disks have different disk controllers
      • improves fault tolerance
  • RAID Level 3
    • Disk Striping with a Parity Drive
      • write data in stripes across multiple drives
      • write parity data to another drive reserved for this purpose
      • if data fails it can be regenerated using parity information
      • requires a minimum of three hard drives
      • data striped in bytes
  • RAID Level 5
    • Disk Striping with Parity Stripes
      • write data in stripes across multiple drives
      • write parity data to stripes across multiple drives
      • if data fails it can be regenerated using parity information
      • requires a minimum of three disks

Some less common approaches follow.

  • RAID Level 2
    • same as RAID 3 except data is striped in bits
  • RAID Level 4
    • same as RAID 2 and RAID 3 except data is striped in blocks

RAID can be implemented as either a hardware or software solution.  Generally, hardware approaches are faster and more reliable.  Many operating systems have built in support for RAID.

Clustering Technologies.  One of the major fundamentals involved in improving reliability of networks is redundancy.  This redundancy can be at the level of hard drives within a server.  It can be at the level of meshing interconnections between routers at the Internet level.  It can also occur at the server level.

Clustering involves grouping servers in a cluster so they compensate for each other.  If one server diminishes in functional capacity then others should be configured to compensate.  This can be done in many different ways.  Windows Advanced Server 2000 allows for this.  UNIX has  had these sorts of features for quite some time.

The Icing.  In some instances it may be necessary to have redundant sites.  These sites may share the load during normal operations.  While it may result in some slow downs, these redundant sources should be configured to be able to take on the demands of the lost site and services.  Making use of such options can be very important in the initial and continuing design.

It is also important to be prepared for dealing with the media.  The media will want to know

  • What happened?
  • What effect it is having on the organization?
  • When services and capabilities will be restored?

You need to be able  to give answers to these questions to the media.

It is usually important to have some sort of link to public relations personnel, either internal or external or both.  It is also important to try and plan ahead on how you will deal with the media.  Developing strategies at the last instant and putting together press releases can be nearly impossible when done during a disaster.  The plans need to include decision makers and those at the highest levels of the chain of command.