< Return to Video

Types of Disaster Recovery and Business Continuity Testing: A Comprehensive Overview

  • 0:03 - 0:05
    We've spent a long time or a couple of
  • 0:05 - 0:08
    videos now discussing our DR testing and
  • 0:08 - 0:10
    processes and our business continuity
  • 0:10 - 0:13
    testing and processes. Let's jump in and
  • 0:13 - 0:14
    take a look at some of the testing
  • 0:14 - 0:17
    methodologies across our both through
  • 0:17 - 0:20
    respectively to our disaster recovery
  • 0:20 - 0:22
    and business continuity. And what's
  • 0:22 - 0:24
    available to us, so in order for us to
  • 0:24 - 0:26
    test and plan, let's take a look at what
  • 0:26 - 0:28
    some of those are. I'm going to separate
  • 0:28 - 0:30
    this out into two couple of areas here,
  • 0:30 - 0:32
    and then we'll just sort of work through
  • 0:32 - 0:33
    this because there's a couple that I
  • 0:33 - 0:35
    want to, sort of, outline. Now the first
  • 0:35 - 0:38
    one is walkthroughs. Now we can outline a
  • 0:38 - 0:40
    couple of walkthroughs as we, just
  • 0:40 - 0:42
    finish writing that out.
  • 0:42 - 0:45
    Walkthroughs is basically running our
  • 0:45 - 0:47
    tabletop exercises or scenarios, right? So
  • 0:47 - 0:50
    these are all, sort of, theory based and
  • 0:50 - 0:53
    they're, sort of, table top scenarios, and
  • 0:53 - 0:54
    you may have a bunch of people that, sort
  • 0:54 - 0:56
    of, come around.
  • 0:56 - 0:59
    That's horrible, isn't it. Table top
  • 0:59 - 1:02
    scenarios, I'll just write 'scen' for short. And
  • 1:02 - 1:03
    these are a bunch of people that we can
  • 1:03 - 1:05
    maybe get together in a boardroom so
  • 1:05 - 1:07
    we've got a table here, and we can all
  • 1:07 - 1:09
    just come around here, you might have a
  • 1:09 - 1:11
    few people. And basically what we can do
  • 1:11 - 1:13
    is we basically give scenarios on how
  • 1:13 - 1:14
    we're going to handle, and these are
  • 1:14 - 1:17
    obviously, you know, your-
  • 1:17 - 1:19
    Okay, I can't draw so I'm just going to
  • 1:19 - 1:21
    remove that all together and just write
  • 1:21 - 1:24
    boardroom. We're in a boardroom.
  • 1:24 - 1:27
    Okay, that's good. Boardroom, great! So
  • 1:27 - 1:29
    basically when we're in the boardroom, we
  • 1:29 - 1:31
    bring everyone around us and do, and
  • 1:31 - 1:33
    perform either from the DR team, and we
  • 1:33 - 1:35
    sit around a tabletop and then, you know,
  • 1:35 - 1:37
    the leader of that, you know, who's
  • 1:37 - 1:39
    driving that, who's got the initiation, or
  • 1:39 - 1:41
    the actual delivery focus of that, picks a
  • 1:41 - 1:43
    scenario and we basically walk through
  • 1:43 - 1:45
    that scenario. So they'll say okay, well
  • 1:45 - 1:46
    we're gonna
  • 1:46 - 1:49
    walk through X.
  • 1:49 - 1:51
    Walk me through this scenario on
  • 1:51 - 1:53
    how you're going to handle this
  • 1:53 - 1:55
    scenario. And then you've got, you know,
  • 1:55 - 1:57
    your DR team which are your IT people
  • 1:57 - 1:58
    that's responsible for that. You may have
  • 1:58 - 2:00
    your network team,
  • 2:00 - 2:01
    you know, your network engineering team.
  • 2:01 - 2:05
    You have your systems guys, girls. You
  • 2:05 - 2:08
    may have your, you know, your change
  • 2:08 - 2:10
    management team there, you may have, you
  • 2:10 - 2:12
    know, your engineering. Maybe you've got
  • 2:12 - 2:15
    your dev teams or dev
  • 2:15 - 2:17
    engineering team, you know, and so on,
  • 2:17 - 2:19
    right? So you've got your IT responsible
  • 2:19 - 2:20
    IT people that are going to be
  • 2:20 - 2:21
    responsible for that process of
  • 2:21 - 2:23
    restoration should we go down that. And
  • 2:23 - 2:26
    we'll run through, you know,
  • 2:26 - 2:28
    site goes offline and, you know,
  • 2:28 - 2:30
    we've got three sites.
  • 2:30 - 2:33
    Give an example that we've got three sites.
  • 2:33 - 2:35
    Now I'm just sort of spitballing here so
  • 2:35 - 2:36
    bear with me if I don't get anything
  • 2:36 - 2:38
    right, but I'm just going to walk through
  • 2:38 - 2:42
    this. So we've got three sites, alrighty.
  • 2:42 - 2:47
    Site x goes down, so this is location,
  • 2:48 - 2:50
    I don't know, Brisbane. Now Brisbane
  • 2:50 - 2:53
    branch has gone off-site, well, what do we
  • 2:53 - 2:54
    do?
  • 2:54 - 2:56
    Okay this is Sydney.
  • 2:56 - 2:58
    This is Melbourne. And obviously all
  • 2:58 - 2:59
    these are all connected and such. And
  • 2:59 - 3:01
    then we've got our backbones back here,
  • 3:01 - 3:03
    obviously they're connecting things as
  • 3:03 - 3:04
    well. So obviously these are all our
  • 3:04 - 3:07
    backbone infrastructure. Well Brisbane
  • 3:07 - 3:09
    location goes offline for whatever reason.
  • 3:09 - 3:12
    Someone, you know, walks into the data
  • 3:12 - 3:13
    center, into the commons room and they've
  • 3:13 - 3:16
    tripped over the cable and now our data
  • 3:16 - 3:18
    center is offline. Well what do we do? And
  • 3:18 - 3:19
    then we walk through that scenario on
  • 3:19 - 3:21
    how someone's going to recover from that
  • 3:21 - 3:23
    situation. Could be minor, it could be
  • 3:23 - 3:24
    significant depending on the scenario. So
  • 3:24 - 3:26
    basically the walkthrough is us walking
  • 3:26 - 3:29
    through. It's the least amount of risk
  • 3:29 - 3:31
    with our DR and obviously business
  • 3:31 - 3:33
    continuity testing because we're talking,
  • 3:33 - 3:35
    but we're not actually doing anything. So
  • 3:35 - 3:37
    again, we're going to involve the,
  • 3:37 - 3:39
    you know, the relevant people in the
  • 3:39 - 3:40
    parties, and then we're going to walk
  • 3:40 - 3:41
    through that scenario based on those
  • 3:41 - 3:45
    SMEs and obviously, their knowledge
  • 3:45 - 3:46
    of how they're going to recover and
  • 3:46 - 3:48
    bring the systems back up to normal in
  • 3:48 - 3:51
    obviously, a time-sensitive approach. So
  • 3:51 - 3:53
    that's walkthroughs.
  • 3:53 - 3:54
    To the other side we've got
  • 3:54 - 3:57
    simulation. So we could run actual
  • 3:57 - 3:59
    simulations. Now simulation could be a
  • 3:59 - 4:01
    physical walkthrough or basically what
  • 4:01 - 4:04
    we call something like a mock event,
  • 4:04 - 4:06
    and we give it a scenario, we give a very
  • 4:06 - 4:09
    specific scenario, and we walk through
  • 4:09 - 4:11
    what we're actually going to do. So we
  • 4:11 - 4:13
    simulate what we're going to do. If we're
  • 4:13 - 4:15
    using backups or the restoration of our
  • 4:15 - 4:17
    backups process, well we would log into
  • 4:17 - 4:19
    the backup app server. So if we're using
  • 4:19 - 4:21
    a specific vendor, we'll say okay, well
  • 4:21 - 4:22
    we're going to log into this server,
  • 4:22 - 4:25
    we're going to click our restoration, you,
  • 4:25 - 4:26
    know, and then our process is, you know,
  • 4:26 - 4:29
    restore hard drive X draw from, you know,
  • 4:29 - 4:31
    server Y, and then that's going to take
  • 4:31 - 4:34
    maybe eight hours to do a full recovery,
  • 4:34 - 4:36
    and then I'm going to take that hard
  • 4:36 - 4:37
    drive, and then that's going to be our
  • 4:37 - 4:39
    [inaudible] from how we're going to recover or
  • 4:39 - 4:40
    whatever that process looks like. So
  • 4:40 - 4:43
    you'll simulate to the point of not
  • 4:43 - 4:45
    actually clicking
  • 4:45 - 4:48
    or doing anything, it's to the point of
  • 4:48 - 4:50
    action, right? So you're gonna, yes, I'm
  • 4:50 - 4:52
    gonna log into the server, I'm going to
  • 4:52 - 4:54
    look around, here's our hypervisors,
  • 4:54 - 4:56
    here's our infrastructure, and here's how
  • 4:56 - 4:57
    we're going to restore that process from
  • 4:57 - 4:58
    there. We're going to log into this
  • 4:58 - 5:00
    vendor's portal page, we're going to get a
  • 5:00 - 5:03
    copy of our off-site backups, whatever
  • 5:03 - 5:04
    that process looks like, right? So you run
  • 5:04 - 5:07
    through that mock simulation.
  • 5:07 - 5:09
    You touch the equipment, you trial it out,
  • 5:09 - 5:11
    but to the point of doing it but not
  • 5:11 - 5:13
    actively executing it. So you're not
  • 5:13 - 5:15
    going to go away and actually execute
  • 5:15 - 5:17
    your recovery, you're just going to
  • 5:17 - 5:19
    basically, simulate it up to the point of
  • 5:19 - 5:21
    of doing it. From here on, then we've
  • 5:21 - 5:25
    got something to do with a parallel
  • 5:26 - 5:29
    test. And parallel testing is something
  • 5:29 - 5:31
    like, basically if we have two
  • 5:31 - 5:32
    environments, and you might have
  • 5:32 - 5:35
    something like a prod
  • 5:35 - 5:38
    and test environment
  • 5:38 - 5:40
    that is a part of this test. And then with the
  • 5:40 - 5:42
    parallel test we would recover our
  • 5:42 - 5:44
    production environment in that test
  • 5:44 - 5:46
    environment. So we would go through all
  • 5:46 - 5:48
    the restore process, but not take
  • 5:48 - 5:50
    production offline, so I'm going to say
  • 5:50 - 5:53
    not offline.
  • 5:54 - 5:56
    This would basically just be doing, you know,
  • 5:56 - 5:57
    we're just going to go away, we're going
  • 5:57 - 5:58
    to test and ensure the backups are
  • 5:58 - 6:01
    working correctly. If there are any folds
  • 6:01 - 6:02
    or lessons to learn or issues that we
  • 6:02 - 6:04
    need to define, then we know what they
  • 6:04 - 6:05
    are, we're aware of those, and everyone
  • 6:05 - 6:07
    knows what to do. So we're not taking
  • 6:07 - 6:09
    production offline, production remains
  • 6:09 - 6:10
    online. We're just going to take our
  • 6:10 - 6:12
    obviously,
  • 6:12 - 6:15
    take our, recover our production
  • 6:15 - 6:16
    environments, we're going to take our
  • 6:16 - 6:17
    product environment, and then we're going
  • 6:17 - 6:18
    to replicate that into our test
  • 6:18 - 6:20
    environment. So we've got a test bed and
  • 6:20 - 6:22
    we're going to see how that process kind
  • 6:22 - 6:23
    of looks, but we're not going to tinkle
  • 6:23 - 6:25
    with or touch our production, and
  • 6:25 - 6:27
    production will remain online and
  • 6:27 - 6:29
    testing. Now the other part of that is
  • 6:29 - 6:30
    our cutover and the cutover is quite
  • 6:30 - 6:33
    similar in that nature.
  • 6:33 - 6:35
    Again that, sort of, prod test
  • 6:35 - 6:37
    scenario, so I'm going to use that. So
  • 6:37 - 6:40
    let's just go prod
  • 6:40 - 6:42
    and then test.
  • 6:42 - 6:44
    And similar to that where we would go,
  • 6:44 - 6:45
    well, we're going to store our prod
  • 6:45 - 6:48
    servers and then take broad offline and
  • 6:48 - 6:50
    bring the restored servers online. So
  • 6:50 - 6:52
    it's a full test, there's interruption
  • 6:52 - 6:54
    involved,
  • 6:54 - 6:55
    you know, obviously, interrupting
  • 6:55 - 6:57
    production as well. So we're going to
  • 6:57 - 6:59
    obviously, do the switch over and
  • 6:59 - 7:01
    obviously, interruption of some sort,
  • 7:01 - 7:03
    right? Now even if it's a minor
  • 7:03 - 7:05
    interruption of, you know, a second or two,
  • 7:05 - 7:07
    that's still an
  • 7:07 - 7:09
    interruption, right? So there will be some
  • 7:09 - 7:11
    sort of interruption, but the cutover
  • 7:11 - 7:13
    test is the full kit and caboodle, right?
  • 7:13 - 7:15
    It's the full test, [inaudible], it's the
  • 7:15 - 7:17
    highest risk because if something does
  • 7:17 - 7:20
    go wrong during that cut over,
  • 7:20 - 7:21
    obviously then it's going to be an
  • 7:21 - 7:23
    outage. So you have to be very mindful of
  • 7:23 - 7:25
    if you're going to do a cut over in any
  • 7:25 - 7:27
    state of testing, that you've either done
  • 7:27 - 7:29
    the parallel test or you've done some sort
  • 7:29 - 7:30
    of mock simulation, you've sort of
  • 7:30 - 7:32
    rehearsed it, you understood it, not just
  • 7:32 - 7:34
    go and do a cut over straight away. Now
  • 7:34 - 7:35
    if you're a smaller environment and you
  • 7:35 - 7:37
    don't have really much to impact,
  • 7:37 - 7:39
    I'm still cautioning against it because
  • 7:39 - 7:42
    a lot of things can go wrong. We want to
  • 7:42 - 7:44
    avoid any disruption or keep that as
  • 7:44 - 7:49
    minimal as possible. Again, I
  • 7:49 - 7:51
    probably wouldn't advise that we turn
  • 7:51 - 7:52
    off the infrastructure or turn off the
  • 7:52 - 7:54
    service per se, I'll probably keep them
  • 7:54 - 7:55
    online or maybe disconnect them from
  • 7:55 - 7:57
    their network ports, that way the servers
  • 7:57 - 7:58
    still remain online if anything does go
  • 7:58 - 8:01
    wrong, we can obviously plug them in and
  • 8:01 - 8:02
    obviously, you know, get things back up
  • 8:02 - 8:04
    and running depending on, you know, the
  • 8:04 - 8:05
    complexity and depending on how things
  • 8:05 - 8:07
    are situated and what's dependent on
  • 8:07 - 8:09
    what. So we want to make sure that we're
  • 8:09 - 8:12
    reducing risk and keeping our downtime
  • 8:12 - 8:15
    minimal as possible. So again, kind of cut
  • 8:15 - 8:16
    over is running through that actual
  • 8:16 - 8:18
    simulation and actually doing everything,
  • 8:18 - 8:20
    and then restoring it into your
  • 8:20 - 8:22
    product environment. So you will go away,
  • 8:22 - 8:23
    you'll restore your test, and then you'll
  • 8:23 - 8:25
    restore it back into prod. Again, you
  • 8:25 - 8:27
    would go through the full cutover, so you
  • 8:27 - 8:29
    will turn off the appliances if you do
  • 8:29 - 8:30
    want to, otherwise you can just
  • 8:30 - 8:31
    disconnect them from the network
  • 8:31 - 8:33
    connection, you know, depending on how you
  • 8:33 - 8:34
    actually want to run the cutover, but
  • 8:34 - 8:35
    the cutover essentially is running that
  • 8:35 - 8:38
    full test. From there,
  • 8:38 - 8:40
    once we've done everything, then we want
  • 8:40 - 8:42
    to go over and document. And this is the
  • 8:42 - 8:45
    most vital part as well as equally
  • 8:45 - 8:47
    important as the rest of them because we
  • 8:47 - 8:49
    are going to want to document and keep
  • 8:49 - 8:53
    things updated, right? So RPO,
  • 8:53 - 8:55
    RTO. So
  • 8:55 - 8:59
    our point of our point objectives, so
  • 8:59 - 9:01
    what is our return of point? What's our
  • 9:01 - 9:04
    time objective? What do they look like? So
  • 9:04 - 9:06
    did we meet those objectives? So I'm
  • 9:06 - 9:08
    going to say meet
  • 9:08 - 9:10
    objectives because obviously
  • 9:10 - 9:11
    everything's going to have some sort of
  • 9:11 - 9:13
    metrics associated with it. So did we
  • 9:13 - 9:15
    meet this? Did this occur in the right
  • 9:15 - 9:17
    manner of the right time? Do we need to
  • 9:17 - 9:18
    work on it? Did something go wrong? Is
  • 9:18 - 9:21
    there room for improvement? So room for
  • 9:21 - 9:23
    improvement,
  • 9:23 - 9:24
    right? That's an 'I'. Room for
  • 9:24 - 9:26
    improvement because chances are, there's
  • 9:26 - 9:27
    something that's going to need improvement, right?
  • 9:27 - 9:29
    Did we do something wrong? Were we not
  • 9:29 - 9:30
    aware of something did something need
  • 9:30 - 9:32
    some training to do something else you
  • 9:32 - 9:34
    know it's a multiple of a multitude of
  • 9:34 - 9:36
    different issues that
  • 9:36 - 9:39
    um you know we we can improve on so
  • 9:39 - 9:42
    that's one and then third Point here
  • 9:42 - 9:43
    that I want to sort of mention is
  • 9:43 - 9:46
    lessons mode so Lessons Learned is what
  • 9:46 - 9:47
    are our key takeaways did we identify
  • 9:47 - 9:49
    something that needs updating because
  • 9:49 - 9:52
    something was missed did we maybe
  • 9:52 - 9:55
    um change a backup solution and did we
  • 9:55 - 9:57
    not know how to you know do we now need
  • 9:57 - 9:59
    to account for those plans and document
  • 9:59 - 10:01
    them plus you know lots of other things
  • 10:01 - 10:02
    right so we don't know what the solution
  • 10:02 - 10:04
    is you know if if we've maybe gone
  • 10:04 - 10:06
    through that solution around and we've
  • 10:06 - 10:08
    maybe implemented a change solution you
  • 10:08 - 10:09
    know
  • 10:09 - 10:11
    do now we do we now need to account for
  • 10:11 - 10:13
    that right so if we've got that solution
  • 10:13 - 10:14
    there maybe we have an account for it so
  • 10:14 - 10:16
    that could be something that in lines
  • 10:16 - 10:18
    about documentation or maybe a role of
  • 10:18 - 10:19
    responsibility with who is now
  • 10:19 - 10:20
    responsible for that maybe that was
  • 10:20 - 10:21
    missed
  • 10:21 - 10:23
    um there's obviously a lot of things
  • 10:23 - 10:24
    that come out of the Lessons Learned
  • 10:24 - 10:25
    basically what you're saying this
  • 10:25 - 10:27
    Lessons Learned is what have we defined
  • 10:27 - 10:28
    and what did we learn during that
  • 10:28 - 10:30
    exercise and then this could be through
  • 10:30 - 10:31
    a procurement this could be
  • 10:31 - 10:33
    technological this could be leadership
  • 10:33 - 10:35
    this could be documentation this could
  • 10:35 - 10:38
    be report this could be you know a bunch
  • 10:38 - 10:40
    of different areas that could improve
  • 10:40 - 10:42
    across through that continuous cycle of
  • 10:42 - 10:44
    improvement around our disaster recovery
  • 10:44 - 10:47
    and business continuity so you know
  • 10:47 - 10:49
    that's the three sort of areas around
  • 10:49 - 10:52
    testing our disaster recovery and
  • 10:52 - 10:54
    business continuity so going through
  • 10:54 - 10:56
    your walkthroughs and there's obviously
  • 10:56 - 10:57
    depending on the appetite of the
  • 10:57 - 10:59
    organization there's no right solution
  • 10:59 - 11:01
    here for for anyone it's just what works
  • 11:01 - 11:02
    and each customer or each people
  • 11:02 - 11:04
    business are at different phases right
  • 11:04 - 11:07
    you've got maybe six customers that are
  • 11:07 - 11:08
    doing you know cut over testing because
  • 11:08 - 11:09
    they're highly mature they've done
  • 11:09 - 11:11
    simulations they've done parallel
  • 11:11 - 11:12
    testing
  • 11:12 - 11:13
    and yeah they just set up to cut over
  • 11:13 - 11:14
    face where they're doing actual
  • 11:14 - 11:16
    stimulation of events but you've got
  • 11:16 - 11:18
    customers that are starting things out
  • 11:18 - 11:19
    and you know quite sensitive to these
  • 11:19 - 11:21
    things so you're going to run some
  • 11:21 - 11:23
    tabletops walk through scenarios and you
  • 11:23 - 11:25
    sort of gradually eat yourself into it
  • 11:25 - 11:26
    so
  • 11:26 - 11:28
    each of these have their own sort of
  • 11:28 - 11:30
    very specific areas there is no right
  • 11:30 - 11:32
    solution for you know there is no Silver
  • 11:32 - 11:34
    Bullet essentially so I hope you've
  • 11:34 - 11:36
    enjoyed this overview introduction into
  • 11:36 - 11:39
    testing of our thus recovery and
  • 11:39 - 11:40
    business continuity I hope you've
  • 11:40 - 11:42
    enjoyed this video see you all in the
  • 11:42 - 11:44
    next video and thank you all for viewing
  • 11:44 - 11:47
    bye for now
Title:
Types of Disaster Recovery and Business Continuity Testing: A Comprehensive Overview
Description:

more » « less
Video Language:
English
Duration:
11:48

English subtitles

Revisions Compare revisions