WEBVTT

00:00:01.550 --> 00:00:03.260
When you're troubleshooting complex

00:00:03.260 --> 00:00:05.330
network problems, you may find that the

00:00:05.330 --> 00:00:07.460
resolution is not as obvious as you

00:00:07.460 --> 00:00:09.440
might hope. In this video, we're going to

00:00:09.440 --> 00:00:11.750
step through a methodology that should

00:00:11.750 --> 00:00:13.610
help you troubleshoot any problem you

00:00:13.610 --> 00:00:16.609
run into. This is the flowchart of that

00:00:16.609 --> 00:00:18.650
network troubleshooting methodology, and

00:00:18.650 --> 00:00:20.660
we're going to step through each section

00:00:20.660 --> 00:00:23.120
of this flow and describe how it can

00:00:23.120 --> 00:00:24.980
help you solve those really difficult

00:00:24.980 --> 00:00:27.619
problems. The first thing you want to do

00:00:27.619 --> 00:00:29.930
is identify the problem. This may not be

00:00:29.930 --> 00:00:32.270
as straightforward as you might think.

00:00:32.270 --> 00:00:33.890
We first need to collect as much

00:00:33.890 --> 00:00:36.170
information as possible about the issue

00:00:36.170 --> 00:00:38.270
that's occurring. In the best possible

00:00:38.270 --> 00:00:39.920
scenario, you'll be able to duplicate

00:00:39.920 --> 00:00:42.260
this problem on demand. This will help

00:00:42.260 --> 00:00:43.760
later as we go through a number of

00:00:43.760 --> 00:00:45.950
testing phases to make sure that we are

00:00:45.950 --> 00:00:48.290
able to resolve this issue. When a

00:00:48.290 --> 00:00:50.149
problem happens on the network, it

00:00:50.149 --> 00:00:52.610
usually affects more than one device, and

00:00:52.610 --> 00:00:55.130
sometimes it affects those devices in

00:00:55.130 --> 00:00:57.020
different ways. You want to be sure to

00:00:57.020 --> 00:00:59.149
document all of the symptoms that may be

00:00:59.149 --> 00:01:01.219
occurring. Even if they are very

00:01:01.219 --> 00:01:03.440
different between different devices, you

00:01:03.440 --> 00:01:05.180
may find that a single problem is

00:01:05.180 --> 00:01:07.009
causing all of these different systems

00:01:07.009 --> 00:01:09.799
across these different devices. Many

00:01:09.799 --> 00:01:11.719
times, these issues will be identified by

00:01:11.719 --> 00:01:13.609
the end users, so they may be able to

00:01:13.609 --> 00:01:15.979
provide you with a lot more detail about

00:01:15.979 --> 00:01:17.539
what's really happening. You should

00:01:17.539 --> 00:01:19.310
question your users to find out what

00:01:19.310 --> 00:01:21.049
they're seeing and if any error messages

00:01:21.049 --> 00:01:23.240
are appearing. In this course, we've

00:01:23.240 --> 00:01:24.979
already discussed the importance of the

00:01:24.979 --> 00:01:27.020
change control process and knowing

00:01:27.020 --> 00:01:28.670
exactly what is changing in your

00:01:28.670 --> 00:01:31.279
environment. Without some type of formal

00:01:31.279 --> 00:01:33.259
change control process, someone may be

00:01:33.259 --> 00:01:35.450
able to make an unscheduled change that

00:01:35.450 --> 00:01:37.310
would affect many different people. So,

00:01:37.310 --> 00:01:39.289
when an error or network problem occurs,

00:01:39.289 --> 00:01:41.539
you may want to find out what the

00:01:41.539 --> 00:01:43.609
last thing was that changed on this network

00:01:43.609 --> 00:01:45.229
that could have affected all of these

00:01:45.229 --> 00:01:47.659
users. There's also going to be times

00:01:47.659 --> 00:01:49.729
when you're examining a number of

00:01:49.729 --> 00:01:52.039
different problems that may not actually

00:01:52.039 --> 00:01:54.439
be related to each other. It's always

00:01:54.439 --> 00:01:56.149
best to separate all of these different

00:01:56.149 --> 00:01:58.399
issues out so that you can approach and

00:01:58.399 --> 00:02:01.179
try to resolve each issue individually.

00:02:01.179 --> 00:02:03.409
Now that you've collected as much

00:02:03.409 --> 00:02:04.939
information as possible,

00:02:04.939 --> 00:02:07.310
you can examine all of these details to

00:02:07.310 --> 00:02:09.680
begin establishing a theory of what you

00:02:09.680 --> 00:02:11.810
think might be going wrong. Since the

00:02:11.810 --> 00:02:14.030
simpler explanation is often the most

00:02:14.030 --> 00:02:15.110
likely reason

00:02:15.110 --> 00:02:17.540
for the issue, that may be a good place

00:02:17.540 --> 00:02:19.580
to start. But, of course, you'll want to

00:02:19.580 --> 00:02:21.890
consider every possible thing that might

00:02:21.890 --> 00:02:24.080
be causing this issue. Maybe start with

00:02:24.080 --> 00:02:26.000
things that aren't completely obvious.

00:02:26.000 --> 00:02:28.400
You could start from the top of the OSI

00:02:28.400 --> 00:02:30.350
model with the way the application is

00:02:30.350 --> 00:02:32.390
working and work your way to the bottom.

00:02:32.390 --> 00:02:34.280
Or, you may want to start with the bottom

00:02:34.280 --> 00:02:36.260
with the cabling and wiring in your

00:02:36.260 --> 00:02:38.330
infrastructure and work your way up from

00:02:38.330 --> 00:02:40.600
there. You'll want to list out every

00:02:40.600 --> 00:02:43.490
possible cause for this problem. Your

00:02:43.490 --> 00:02:45.170
list might start with the easy theories

00:02:45.170 --> 00:02:47.240
at the top, but of course, include all of

00:02:47.240 --> 00:02:49.160
the more complex theories in this list

00:02:49.160 --> 00:02:51.800
as well. Now that we have a list of

00:02:51.800 --> 00:02:54.110
theories on how to resolve this issue, we

00:02:54.110 --> 00:02:56.209
can now test those theories. We may want

00:02:56.209 --> 00:02:58.489
to go into a lab. And if we are able to

00:02:58.489 --> 00:03:00.920
recreate this problem in the lab, then we

00:03:00.920 --> 00:03:03.800
can apply each theory until we find the

00:03:03.800 --> 00:03:05.840
one that happens to resolve the issue. If

00:03:05.840 --> 00:03:08.150
you tried the first theory, you may want

00:03:08.150 --> 00:03:10.070
to reset everything and try the second

00:03:10.070 --> 00:03:12.050
theory or the third. And if you run out

00:03:12.050 --> 00:03:13.790
of theories, you may want to go back and

00:03:13.790 --> 00:03:15.500
think of other things that might be

00:03:15.500 --> 00:03:17.540
causing this problem. This might be a

00:03:17.540 --> 00:03:19.340
good time to bring in an expert who

00:03:19.340 --> 00:03:20.840
knows about the application or the

00:03:20.840 --> 00:03:22.700
infrastructure, and they can give some

00:03:22.700 --> 00:03:24.890
theories and possible resolutions to

00:03:24.890 --> 00:03:27.590
test in the lab. Once you've tested a

00:03:27.590 --> 00:03:29.299
theory and found that the theory is

00:03:29.299 --> 00:03:31.489
going to resolve this issue, you can then

00:03:31.489 --> 00:03:33.380
begin putting together a plan of action.

00:03:33.380 --> 00:03:35.600
This is how you would implement this fix

00:03:35.600 --> 00:03:38.060
into a production network. You want to be

00:03:38.060 --> 00:03:39.920
sure that you're able to do this with a

00:03:39.920 --> 00:03:41.780
minimum amount of impact to the

00:03:41.780 --> 00:03:43.880
production network. And sometimes, you

00:03:43.880 --> 00:03:45.860
have to do this after hours when nobody

00:03:45.860 --> 00:03:48.260
else is working on the network. You want

00:03:48.260 --> 00:03:49.640
to be able to implement this with a

00:03:49.640 --> 00:03:52.250
minimum amount of impact to production

00:03:52.250 --> 00:03:54.380
traffic. So often, you'll have to do this

00:03:54.380 --> 00:03:57.410
after hours. A best practice is to

00:03:57.410 --> 00:03:59.510
document the exact steps that will be

00:03:59.510 --> 00:04:00.830
required to solve this particular

00:04:00.830 --> 00:04:03.560
problem. If it's replacing a cable, then

00:04:03.560 --> 00:04:04.970
the process will be relatively

00:04:04.970 --> 00:04:06.680
straightforward. But if you're upgrading

00:04:06.680 --> 00:04:08.989
software in a switch, a router, or a

00:04:08.989 --> 00:04:11.750
firewall, there may be additional tasks

00:04:11.750 --> 00:04:13.579
involved in performing this plan of

00:04:13.579 --> 00:04:15.500
action. You'll also want some

00:04:15.500 --> 00:04:17.540
alternatives if your plan doesn't go as

00:04:17.540 --> 00:04:19.700
designed. For example, you may run into

00:04:19.700 --> 00:04:21.590
problems when upgrading the software in

00:04:21.590 --> 00:04:23.599
a firewall. So, you may need an additional

00:04:23.599 --> 00:04:26.150
firewall or a way to roll back to the

00:04:26.150 --> 00:04:27.250
previous version.

00:04:27.250 --> 00:04:28.580
Now that you've

00:04:28.580 --> 00:04:30.229
documented your plan of action, you can

00:04:30.229 --> 00:04:31.729
take that to your change control team,

00:04:31.729 --> 00:04:33.770
and they can give you a window when you

00:04:33.770 --> 00:04:35.900
can implement that change. The actual

00:04:35.900 --> 00:04:38.240
fixing of the issue is probably going to

00:04:38.240 --> 00:04:40.769
be during off hours, during non-production

00:04:40.769 --> 00:04:41.990
times, and you may need to

00:04:41.990 --> 00:04:43.729
bring in other people to assist,

00:04:43.729 --> 00:04:46.539
especially if your window is very small.

00:04:46.539 --> 00:04:48.860
Once you have executed on your plan of

00:04:48.860 --> 00:04:50.719
action, your job isn't done yet.

00:04:50.719 --> 00:04:52.520
We need to make sure that all of these

00:04:52.520 --> 00:04:55.460
changes actually resolve the problem. So,

00:04:55.460 --> 00:04:56.750
now that the changes have been

00:04:56.750 --> 00:04:59.000
implemented, we now need to perform some

00:04:59.000 --> 00:05:01.190
tests. We may want to bring in the end

00:05:01.190 --> 00:05:03.139
users who first experienced this problem

00:05:03.139 --> 00:05:05.719
so that they can run through exactly the

00:05:05.719 --> 00:05:07.789
same scenario to tell you if the problem

00:05:07.789 --> 00:05:10.530
is resolved or if the problem still exists.

00:05:10.530 --> 00:05:12.379
This might also be a good time to

00:05:12.379 --> 00:05:14.240
implement some preventive measures. That

00:05:14.240 --> 00:05:16.460
way, we can either be informed that the

00:05:16.460 --> 00:05:18.430
problem is occurring, or we can provide

00:05:18.430 --> 00:05:20.569
alternatives that we can implement if

00:05:20.569 --> 00:05:23.719
that problem happens again. After the

00:05:23.719 --> 00:05:25.430
problem has been resolved, this is a

00:05:25.430 --> 00:05:27.439
perfect time to document the entire

00:05:27.439 --> 00:05:29.750
process from the very beginning to the

00:05:29.750 --> 00:05:31.639
very end. You'll, of course, want to

00:05:31.639 --> 00:05:33.379
provide as much information as possible.

00:05:33.379 --> 00:05:35.779
So, if somebody runs into this issue

00:05:35.779 --> 00:05:37.940
again, they can simply search your

00:05:37.940 --> 00:05:40.159
knowledge base, find that particular error

00:05:40.159 --> 00:05:42.050
that popped up, and know exactly the

00:05:42.050 --> 00:05:44.529
process you used to solve this last time.

00:05:44.529 --> 00:05:47.539
Many organizations have a help desk with

00:05:47.539 --> 00:05:49.250
case notes that they can reference, or

00:05:49.250 --> 00:05:50.930
you might have a separate knowledge base

00:05:50.930 --> 00:05:53.000
or wiki that you create where you're

00:05:53.000 --> 00:05:54.440
storing all of this important

00:05:54.440 --> 00:05:57.349
information for the future. A document

00:05:57.349 --> 00:05:58.969
that was created a number of years ago

00:05:58.969 --> 00:06:00.919
but still shows the importance of

00:06:00.919 --> 00:06:02.900
keeping this documentation over time is

00:06:02.900 --> 00:06:04.430
from Google Research, where they

00:06:04.430 --> 00:06:06.830
documented the failure trends in a large

00:06:06.830 --> 00:06:09.229
disk drive population. And because they

00:06:09.229 --> 00:06:11.449
were keeping extensive data over a long

00:06:11.449 --> 00:06:13.789
period of time, they were able to tell

00:06:13.789 --> 00:06:16.250
when a drive was starting to fail based

00:06:16.250 --> 00:06:17.990
on the types of errors that they were

00:06:17.990 --> 00:06:20.360
receiving. Being able to store all of

00:06:20.360 --> 00:06:22.039
this important information, being

00:06:22.039 --> 00:06:23.990
able to go back in time to see what

00:06:23.990 --> 00:06:26.270
happened, becomes a very important part

00:06:26.270 --> 00:06:28.629
of maintaining a network for the future.

00:06:28.629 --> 00:06:31.159
Let's summarize this troubleshooting

00:06:31.159 --> 00:06:33.050
methodology. We start with gathering as

00:06:33.050 --> 00:06:35.449
much information as possible, asking

00:06:35.449 --> 00:06:37.190
users about what they're seeing, and

00:06:37.190 --> 00:06:39.500
documenting any specific error messages.

00:06:39.500 --> 00:06:41.330
Then, we want to be able to create a

00:06:41.330 --> 00:06:42.050
number of

00:06:42.050 --> 00:06:43.789
theories that might solve this particular

00:06:43.789 --> 00:06:46.759
problem. And once we have this list, we

00:06:46.759 --> 00:06:48.379
want to be able to put it in the lab and

00:06:48.379 --> 00:06:50.389
try testing each one of these theories

00:06:50.389 --> 00:06:52.340
until we find the one that actually

00:06:52.340 --> 00:06:55.009
resolves the issue. From there, we can

00:06:55.009 --> 00:06:57.169
create a plan of action and document any

00:06:57.169 --> 00:06:59.599
possible problems that might occur. We

00:06:59.599 --> 00:07:01.280
can then get a time to implement the

00:07:01.280 --> 00:07:03.530
issue and put it into our production

00:07:03.530 --> 00:07:05.659
environment. And then we can verify and

00:07:05.659 --> 00:07:07.789
test and make sure that the entire

00:07:07.789 --> 00:07:10.729
system is now working as expected. And, of

00:07:10.729 --> 00:07:12.199
course, finally, we want to document

00:07:12.199 --> 00:07:14.780
everything that we did from the very

00:07:14.780 --> 00:07:16.729
beginning of our troubleshooting process

00:07:16.729 --> 00:07:19.387
all the way through to the end.