5 Levels of
DR Planning
Level 1
Threat of
disaster without evidence
Essentially, this level
encompasses everything that doesn't do damage to your data
systems or offer any proof of attack, but which could be a
publicity or regulatory nightmare. Common examples are posted
boasts about incursions into your network on blogs and Web
forums or claims that proprietary data was compromised even
though no evidence is offered.
The major issue
with these kinds of disasters is that you can't prove or
disprove them in many cases. Even if you have advanced security
measures in place, employee collusion can easily overcome
those measures without showing any weakness in the digital
security itself. Since this level of threat doesn't have any
evidence associated with it, dealing with the bad publicity can
be just as devastating to your organization as data loss.
Level 2
Actual
attack without data loss
Once an attacker has breached
your security digitally and there's evidence of his or her
attack, your IT staff will need to be able to show what happened
and how. In these cases, there is clear proof of the attack but
not of the extent of the attack. How far did they get into your
network, what did they see, what did they take? Just because
they didn't destroy anything doesn't mean you can call this
anything but a disaster.
Virus attacks, intruders, and
other types of Level 2 disasters are extremely difficult to deal
with. Generally, you can prepare for them only by implementing
proper security measures and by using penetration-testing
tools, but when these disasters strike, it is—by their very
nature—via the method you least expect. For virus attacks,
immediate quarantine is necessary both for the infected files
and for the infected server systems. Failure to move quickly to
stop the spread of the infection can lead to more and more
damage as the minutes tick by. This may mean suspending e-mail
service, locking out file servers, or other actions that
interrupt production for your users, but in the end it will mean
that you will save the remainder of your data from the same fate
as that which is already under attack.
For network intrusions, not only
do you have to quarantine the affected systems, but you also
have to find the security hole that the intruder used. This must
be done quickly, and a patch must be found immediately to make
sure others don't come in the same way. Since the attack was
against your systems specifically, you may also want to attempt
to find out who the intruder is, if you have the time and proper
equipment to do so.
After you have dealt with the
original attack, your next steps are to salvage as much data as
you can and take preventive measures to make sure the same
attack doesn’t occur again. This could mean anything from
running antivirus tools to performing extensive analyses to see
what data was viewed by an intruder. Document everything
methodically and completely, as insurance carriers and your
company's management will be looking for this information in the
aftermath. Testing with variations of the same attack, changing
virus protection schemes, and other strategies can help to make
sure you don’t fall prey to a simple change in the same method
someone used to attack you once already.
Level 2 disasters often don’t cause downtime all on their own. However,
the aftermath of dealing with them can cut off vital systems to
save the rest of your organization. The decisions on how you
will react will seriously affect your end users and therefore
must be part of your disaster recovery planning well before the
attack actually strikes your enterprise.
Level 3
Minor
data/system loss
When data systems and data are
lost to natural causes, attacks, or system failures, you enter
the level that most people consider disasters. Level 3 deals
mostly with smaller-scale issues: The loss of noncritical
systems or a single critical system that can be restored
quickly. The key difference between this level and those that
follow is that here we see disasters that have a high priority
but not a high urgency. Your Recovery Time Objective is probably
at least one business day, giving you time to react and correct.
End users
can continue to do their jobs without this data and/or without
these systems, but your staff must still get them back up and
running or find out what was lost. First, you'll need to figure
out what went wrong and ensure the damage is contained. This may
require verification of backup systems for other data systems,
test restorations of controlled and previously backed-up data,
and the determination of what caused the system failures. Your
goal is to make sure that you won't lose data or suffer the
long-term loss of a critical system. Once you've contained the
problem, you can begin to address it. This may mean rebuilding
the affected systems as quickly as possible and restoring all
known-good data, running antivirus and/or other security
measures to clean the systems and data, and performing other
measures to bring your systems back.
Level 4
Major
data/system loss
Larger-scale disasters fall under
Level 4. This is where multiple critical systems fail at the
same time, possibly due to power loss or fire/flood in the data
center. Although you can correct for these issues, it will
require an immediate response from your staff, moving quickly to
get business-critical systems back up and running. Systems that
have a Recovery Time Objective of less than one business day
fall into this category when they fail.
With Level 4 disasters, you don’t
have time to move methodically, but you must proceed with
extreme care whenever possible. Failure to do so could result in
a recurrence of whatever caused the disaster in the first place,
leading to more downtime. You will be forced to immediately
restore any and all data that you can ensure is not corrupt,
and—if you have some form of high-availability solution—you must
allow your critical data-systems to fail over and resume
operation. Initially, you will be acting fast to restore as much
of your data and services as quickly as you can so that end
users can resume working with those systems while you find out
what went wrong. In Level 4 disasters, you don't carry out a
complete investigation until after the restoration of service.
That being said, you must be as careful as possible while
restoring services. Moving too fast could easily result in a
recurrence of the disaster due to your staff missing some
critical fault and could actually compound the problem. If you
rush, misconfigurations or accidents could occur that cause
additional damage. Move quickly, but stay in control of the
situation at all times, no matter how loudly the executives are
screaming to get everything back up immediately. If you have
failover systems, perform a quick check to ensure that you have
a stable platform at your DR site and then restore operations.
If the platform isn't stable, you can make the changes necessary
to begin the data-restoration process, preceding a return
to service. Either way, this emergency
calls for an acute awareness of your systems' health as you move
forward.
Level 5
Total
data/system loss
The highest level in the system,
a Level 5 classification is invoked only in cases where a
disaster causes massive disruption in services. Hurricanes,
large-scale floods and fires, and building loss are usually
found here, with a twin disaster of loss of data systems and the
physical plant to recover to. Due to considerations such as loss
of space, loss of life, and psychological impact, recovery is an
exceptionally difficult—though necessary—task.
Although the largest organizations will be preparing for these
disasters with availability solutions to allow them to fail over
quickly to another data center outside the scope of the disaster
area, most companies will find that a response to a Level 5
disaster is truly a recovery effort instead of a failover
exercise. The vast majority of organizations won't be able to
afford or manage DR data centers that lie far enough away from
the primary facility to be helpful in this type of disaster, so
their DR systems will be affected by the same event that
disrupted service at the primary site. Even if you can't afford
to keep full-fledged systems up and running at another location,
you can contract to keep
backup tapes and other copies of your data in far-flung
locations. Many companies specialize in just such recovery
services, allowing you to find one that fits both your needs and
your budget. This will enable you to deal with the immediate
impact of the event and then recover your data to new systems
from the copies warehoused off-site after they're returned to
you by the contractor.
At this level of disaster, you'll
also have to deal with nontechnical issues well before your
technology plant can come back online. Level 5 disasters almost
always include loss of physical space and—unfortunately—loss of
life as well. When your employees are no longer available to
enact a DR plan, you will need to act as quickly as possible,
given the situation, to find new staffers, train them, and get
things up and running again. Also keep in mind the immense
psychological impact of these kinds of disasters. Employees have
probably just lost their homes and possibly family members and
friends as well. Attempts to coerce such employees to
immediately report back to work is unfair and in many cases
unethical, which could leave some large gaps in your DR efforts.
Temporary staff may be available in some cases for you to use in
the short term, but for the majority of cases you will simply
have to redefine your DR plan to take the extra recovery time
into consideration.
The best
planning you can do for a Level 5 emergency is to prepare
everyone for what they can expect and hold firm if executives
try to make you commit to anything unreasonable. Set up phone
chains and other alerting structures ahead of time, get your
data out of the scope of potential disasters that may affect
your production environment, and be ready to deal with the harsh
consequences of a massive disaster. The best you can do is to
prepare: Level 5 disasters will find every hole your DR plan has
to offer.
|