-
We've spent a long time or a couple of
-
videos now discussing our DR testing and
-
processes and our business continuity
-
testing and processes. Let's jump in and
-
take a look at some of the testing
-
methodologies across our both through
-
respectively to our disaster recovery
-
and business continuity. And what's
-
available to us, so in order for us to
-
test and plan, let's take a look at what
-
some of those are. I'm going to separate
-
this out into two couple of areas here,
-
and then we'll just sort of work through
-
this because there's a couple that I
-
want to, sort of, outline. Now the first
-
one is walkthroughs. Now we can outline a
-
couple of walkthroughs as we, just
-
finish writing that out.
-
Walkthroughs is basically running our
-
tabletop exercises or scenarios, right? So
-
these are all, sort of, theory based and
-
they're, sort of, table top scenarios, and
-
you may have a bunch of people that, sort
-
of, come around.
-
That's horrible, isn't it. Table top
-
scenarios, I'll just write 'scen' for short. And
-
these are a bunch of people that we can
-
maybe get together in a boardroom so
-
we've got a table here, and we can all
-
just come around here, you might have a
-
few people. And basically what we can do
-
is we basically give scenarios on how
-
we're going to handle, and these are
-
obviously, you know, your-
-
Okay, I can't draw so I'm just going to
-
remove that all together and just write
-
boardroom. We're in a boardroom.
-
Okay, that's good. Boardroom, great! So
-
basically when we're in the boardroom, we
-
bring everyone around us and do, and
-
perform either from the DR team, and we
-
sit around a tabletop and then, you know,
-
the leader of that, you know, who's
-
driving that, who's got the initiation, or
-
the actual delivery focus of that, picks a
-
scenario and we basically walk through
-
that scenario. So they'll say okay, well
-
we're gonna
-
walk through X.
-
Walk me through this scenario on
-
how you're going to handle this
-
scenario. And then you've got, you know,
-
your DR team which are your IT people
-
that's responsible for that. You may have
-
your network team,
-
you know, your network engineering team.
-
You have your systems guys, girls. You
-
may have your, you know, your change
-
management team there, you may have, you
-
know, your engineering. Maybe you've got
-
your dev teams or dev
-
engineering team, you know, and so on,
-
right? So you've got your IT responsible
-
IT people that are going to be
-
responsible for that process of
-
restoration should we go down that. And
-
we'll run through, you know,
-
site goes offline and, you know,
-
we've got three sites.
-
Give an example that we've got three sites.
-
Now I'm just sort of spitballing here so
-
bear with me if I don't get anything
-
right, but I'm just going to walk through
-
this. So we've got three sites, alrighty.
-
Site x goes down, so this is location,
-
I don't know, Brisbane. Now Brisbane
-
branch has gone off-site, well, what do we
-
do?
-
Okay this is Sydney.
-
This is Melbourne. And obviously all
-
these are all connected and such. And
-
then we've got our backbones back here,
-
obviously they're connecting things as
-
well. So obviously these are all our
-
backbone infrastructure. Well Brisbane
-
location goes offline for whatever reason.
-
Someone, you know, walks into the data
-
center, into the commons room and they've
-
tripped over the cable and now our data
-
center is offline. Well what do we do? And
-
then we walk through that scenario on
-
how someone's going to recover from that
-
situation. Could be minor, it could be
-
significant depending on the scenario. So
-
basically the walkthrough is us walking
-
through. It's the least amount of risk
-
with our DR and obviously business
-
continuity testing because we're talking,
-
but we're not actually doing anything. So
-
again, we're going to involve the,
-
you know, the relevant people in the
-
parties, and then we're going to walk
-
through that scenario based on those
-
SMEs and obviously, their knowledge
-
of how they're going to recover and
-
bring the systems back up to normal in
-
obviously, a time-sensitive approach. So
-
that's walkthroughs.
-
To the other side we've got
-
simulation. So we could run actual
-
simulations. Now simulation could be a
-
physical walkthrough or basically what
-
we call something like a mock event,
-
and we give it a scenario, we give a very
-
specific scenario, and we walk through
-
what we're actually going to do. So we
-
simulate what we're going to do. If we're
-
using backups or the restoration of our
-
backups process, well we would log into
-
the backup app server. So if we're using
-
a specific vendor, we'll say okay, well
-
we're going to log into this server,
-
we're going to click our restoration, you,
-
know, and then our process is, you know,
-
restore hard drive X draw from, you know,
-
server Y, and then that's going to take
-
maybe eight hours to do a full recovery,
-
and then I'm going to take that hard
-
drive, and then that's going to be our
-
[inaudible] from how we're going to recover or
-
whatever that process looks like. So
-
you'll simulate to the point of not
-
actually clicking
-
or doing anything, it's to the point of
-
action, right? So you're gonna, yes, I'm
-
gonna log into the server, I'm going to
-
look around, here's our hypervisors,
-
here's our infrastructure, and here's how
-
we're going to restore that process from
-
there. We're going to log into this
-
vendor's portal page, we're going to get a
-
copy of our off-site backups, whatever
-
that process looks like, right? So you run
-
through that mock simulation.
-
You touch the equipment, you trial it out,
-
but to the point of doing it but not
-
actively executing it. So you're not
-
going to go away and actually execute
-
your recovery, you're just going to
-
basically, simulate it up to the point of
-
of doing it. From here on, then we've
-
got something to do with a parallel
-
test. And parallel testing is something
-
like, basically if we have two
-
environments, and you might have
-
something like a prod
-
and test environment
-
that is a part of this test. And then with the
-
parallel test we would recover our
-
production environment in that test
-
environment. So we would go through all
-
the restore process, but not take
-
production offline, so I'm going to say
-
not offline.
-
This would basically just be doing, you know,
-
we're just going to go away, we're going
-
to test and ensure the backups are
-
working correctly. If there are any folds
-
or lessons to learn or issues that we
-
need to define, then we know what they
-
are, we're aware of those, and everyone
-
knows what to do. So we're not taking
-
production offline, production remains
-
online. We're just going to take our
-
obviously,
-
take our, recover our production
-
environments, we're going to take our
-
product environment, and then we're going
-
to replicate that into our test
-
environment. So we've got a test bed and
-
we're going to see how that process kind
-
of looks, but we're not going to tinkle
-
with or touch our production, and
-
production will remain online and
-
testing. Now the other part of that is
-
our cutover and the cutover is quite
-
similar in that nature.
-
Again that, sort of, prod test
-
scenario, so I'm going to use that. So
-
let's just go prod
-
and then test.
-
And similar to that where we would go,
-
well, we're going to store our prod
-
servers and then take broad offline and
-
bring the restored servers online. So
-
it's a full test, there's interruption
-
involved,
-
you know, obviously interrupting
-
production as well. So we're going to
-
obviously do the switch over and
-
obviously interruption of some sort,
-
right? Now even if it's a minor
-
interruption of, you know, a second or two,
-
that's still an
-
interruption, right? So there will be some
-
sort of interruption, but the cutover
-
test is the full kit and caboodle, right?
-
It's the full test, [inaudible], it's the
-
highest risk because if something does
-
go wrong during that cutover,
-
obviously then it's going to be an
-
outage. So you have to be very mindful of
-
if you're going to do a cutover in any
-
state of testing, that you've either done
-
the parallel test or you've done some sort
-
of mock simulation, you've sort of
-
rehearsed it, you understood it, not just
-
go and do a cutover straight away. Now
-
if you're a smaller environment and you
-
don't have really much to impact,
-
I'm still cautioning against it because
-
a lot of things can go wrong. We want to
-
avoid any disruption or keep that as
-
minimal as possible. Again, I
-
probably wouldn't advise that we turn
-
off the infrastructure or turn off the
-
service per se, I'll probably keep them
-
online or maybe disconnect them from
-
their network ports, that way the servers
-
still remain online if anything does go
-
wrong, we can obviously plug them in and
-
obviously, you know, get things back up
-
and running depending on, you know, the
-
complexity and depending on how things
-
are situated and what's dependent on
-
what. So we want to make sure that we're
-
reducing risk and keeping our downtime
-
minimal as possible. So again, kind of cut
-
over is running through that actual
-
simulation and actually doing everything,
-
and then restoring it into your
-
product environment. So you will go away,
-
you'll restore your test, and then you'll
-
restore it back into prod. Again, you
-
would go through the full cutover, so you
-
will turn off the appliances if you do
-
want to, otherwise you can just
-
disconnect them from the network
-
connection, you know, depending on how you
-
actually want to run the cutover, but
-
the cutover essentially is running that
-
full test. From there,
-
once we've done everything, then we want
-
to go over and document. And this is the
-
most vital part as well as equally
-
important as the rest of them because we
-
are going to want to document and keep
-
things updated, right? So RPO,
-
RTO. So
-
our point of our point objectives, so
-
what is our return of point? What's our
-
time objective? What do they look like? So
-
did we meet those objectives? So I'm
-
going to say meet
-
objectives because obviously
-
everything's going to have some sort of
-
metrics associated with it. So did we
-
meet this? Did this occur in the right
-
manner of the right time? Do we need to
-
work on it? Did something go wrong? Is
-
there room for improvement? So room for
-
improvement,
-
right, that's an 'I'. Room for
-
improvement because chances are, there's
-
something that's going to need improvement, right?
-
Did we do something wrong? Were we not
-
aware of something? Did somebody need
-
some training to do something else? You
-
know, it's a multitude of
-
different issues that,
-
you know, we can improve on. So
-
that's one, and then the third point here
-
that I want to sort of mention is
-
lessons learned. So lessons learned is what
-
are our key takeaways? Did we identify
-
something that needs updating because
-
something was missed? Did we maybe
-
change a backup solution and did we
-
not know how to, you know, do we now need
-
to account for those plans and document
-
them? Plus, you know, lots of other things,
-
right? So we don't know what the solution
-
is, you know, if we've maybe gone
-
through that solution [inaudible] and we've
-
maybe implemented a changed solution, you
-
know,
-
do we now need to account for
-
that, right? So if we've got that solution
-
there, maybe we have an account for it. So
-
that could be something that in lines
-
with our documentation or maybe a role of
-
responsibility with who is now
-
responsible for that? Maybe that was
-
missed.
-
There's obviously a lot of things
-
that come out of the lessons learned.
-
Basically what you're saying is
-
lessons learned is what have we defined
-
and what did we learn during that
-
exercise? And then this could be through
-
a procurement, this could be
-
technological, this could be leadership,
-
this could be documentation, this could
-
be a report, this could be, you know, a bunch
-
of different areas that could improve
-
across through that continuous cycle of
-
improvement around our disaster recovery
-
and business continuity. So, you know,
-
that's the three, sort of, areas around
-
testing our disaster recovery and
-
business continuity. So going through
-
your walkthroughs and there's obviously,
-
depending on the appetite of the
-
organization, there's no right solution
-
here for anyone. It's just what works
-
and each customer or each people,
-
business are at different phases, right?
-
You've got maybe six customers that are
-
doing, you know, cutover testing because
-
they're highly mature, they've done
-
simulations, they've done parallel
-
testing,
-
and now they just set up their cutover
-
phase where they're doing actual
-
simulation of events. But you got
-
customers that are starting things out
-
and, you know, are quite sensitive to these
-
things, so you're going to run some
-
tabletops, walk through scenarios, and you,
-
sort of, gradually ease yourself into it.
-
So
-
each of these have their own, sort of,
-
very specific areas. There is no right
-
solution for, you know, there is no Silver
-
Bullet essentially. So I hope you've
-
enjoyed this overview/introduction into
-
testing of our disaster recovery and
-
business continuity. I hope you've
-
enjoyed this video. See you all in the
-
next video and thank you all for viewing.
-
Bye for now.