-
When you're troubleshooting complex
-
network problems, you may find that the
-
resolution is not as obvious as you
-
might hope. In this video, we're going to
-
step through a methodology that should
-
help you troubleshoot any problem you
-
run into. This is the flowchart of that
-
network troubleshooting methodology, and
-
we're going to step through each section
-
of this flow and describe how it can
-
help you solve those really difficult
-
problems. The first thing you want to do
-
is identify the problem. This may not be
-
as straightforward as you might think.
-
We first need to collect as much
-
information as possible about the issue
-
that's occurring. In the best possible
-
scenario, you'll be able to duplicate
-
this problem on demand. This will help
-
later as we go through a number of
-
testing phases to make sure that we are
-
able to resolve this issue. When a
-
problem happens on the network, it
-
usually affects more than one device, and
-
sometimes it affects those devices in
-
different ways. You want to be sure to
-
document all of the symptoms that may be
-
occurring. Even if they are very
-
different between different devices, you
-
may find that a single problem is
-
causing all of these different systems
-
across these different devices. Many
-
times, these issues will be identified by
-
the end users, so they may be able to
-
provide you with a lot more detail about
-
what's really happening. You should
-
question your users to find out what
-
they're seeing and if any error messages
-
are appearing. In this course, we've
-
already discussed the importance of the
-
change control process and knowing
-
exactly what is changing in your
-
environment. Without some type of formal
-
change control process, someone may be
-
able to make an unscheduled change that
-
would affect many different people. So,
-
when an error or network problem occurs,
-
you may want to find out what the
-
last thing was that changed on this network
-
that could have affected all of these
-
users. There's also going to be times
-
when you're examining a number of
-
different problems that may not actually
-
be related to each other. It's always
-
best to separate all of these different
-
issues out so that you can approach and
-
try to resolve each issue individually.
-
Now that you've collected as much
-
information as possible,
-
you can examine all of these details to
-
begin establishing a theory of what you
-
think might be going wrong. Since the
-
simpler explanation is often the most
-
likely reason
-
for the issue, that may be a good place
-
to start. But, of course, you'll want to
-
consider every possible thing that might
-
be causing this issue. Maybe start with
-
things that aren't completely obvious.
-
You could start from the top of the OSI
-
model with the way the application is
-
working and work your way to the bottom.
-
Or, you may want to start with the bottom
-
with the cabling and wiring in your
-
infrastructure and work your way up from
-
there. You'll want to list out every
-
possible cause for this problem. Your
-
list might start with the easy theories
-
at the top, but of course, include all of
-
the more complex theories in this list
-
as well. Now that we have a list of
-
theories on how to resolve this issue, we
-
can now test those theories. We may want
-
to go into a lab. And if we are able to
-
recreate this problem in the lab, then we
-
can apply each theory until we find the
-
one that happens to resolve the issue. If
-
you tried the first theory, you may want
-
to reset everything and try the second
-
theory or the third. And if you run out
-
of theories, you may want to go back and
-
think of other things that might be
-
causing this problem. This might be a
-
good time to bring in an expert who
-
knows about the application or the
-
infrastructure, and they can give some
-
theories and possible resolutions to
-
test in the lab. Once you've tested a
-
theory and found that the theory is
-
going to resolve this issue, you can then
-
begin putting together a plan of action.
-
This is how you would implement this fix
-
into a production network. You want to be
-
sure that you're able to do this with a
-
minimum amount of impact to the
-
production network. And sometimes, you
-
have to do this after hours when nobody
-
else is working on the network. You want
-
to be able to implement this with a
-
minimum amount of impact to production
-
traffic. So often, you'll have to do this
-
after hours. A best practice is to
-
document the exact steps that will be
-
required to solve this particular
-
problem. If it's replacing a cable, then
-
the process will be relatively
-
straightforward. But if you're upgrading
-
software in a switch, a router, or a
-
firewall, there may be additional tasks
-
involved in performing this plan of
-
action. You'll also want some
-
alternatives if your plan doesn't go as
-
designed. For example, you may run into
-
problems when upgrading the software in
-
a firewall. So, you may need an additional
-
firewall or a way to roll back to the
-
previous version.
-
Now that you've
-
documented your plan of action, you can
-
take that to your change control team,
-
and they can give you a window when you
-
can implement that change. The actual
-
fixing of the issue is probably going to
-
be during off hours, during non-production
-
times, and you may need to
-
bring in other people to assist,
-
especially if your window is very small.
-
Once you have executed on your plan of
-
action, your job isn't done yet.
-
We need to make sure that all of these
-
changes actually resolve the problem. So,
-
now that the changes have been
-
implemented, we now need to perform some
-
tests. We may want to bring in the end
-
users who first experienced this problem
-
so that they can run through exactly the
-
same scenario to tell you if the problem
-
is resolved or if the problem still exists.
-
This might also be a good time to
-
implement some preventive measures. That
-
way, we can either be informed that the
-
problem is occurring, or we can provide
-
alternatives that we can implement if
-
that problem happens again. After the
-
problem has been resolved, this is a
-
perfect time to document the entire
-
process from the very beginning to the
-
very end. You'll, of course, want to
-
provide as much information as possible.
-
So, if somebody runs into this issue
-
again, they can simply search your
-
knowledge base, find that particular error
-
that popped up, and know exactly the
-
process you used to solve this last time.
-
Many organizations have a help desk with
-
case notes that they can reference, or
-
you might have a separate knowledge base
-
or wiki that you create where you're
-
storing all of this important
-
information for the future. A document
-
that was created a number of years ago
-
but still shows the importance of
-
keeping this documentation over time is
-
from Google Research, where they
-
documented the failure trends in a large
-
disk drive population. And because they
-
were keeping extensive data over a long
-
period of time, they were able to tell
-
when a drive was starting to fail based
-
on the types of errors that they were
-
receiving. Being able to store all of
-
this important information, being
-
able to go back in time to see what
-
happened, becomes a very important part
-
of maintaining a network for the future.
-
Let's summarize this troubleshooting
-
methodology. We start with gathering as
-
much information as possible, asking
-
users about what they're seeing, and
-
documenting any specific error messages.
-
Then, we want to be able to create a
-
number of
-
theories that might solve this particular
-
problem. And once we have this list, we
-
want to be able to put it in the lab and
-
try testing each one of these theories
-
until we find the one that actually
-
resolves the issue. From there, we can
-
create a plan of action and document any
-
possible problems that might occur. We
-
can then get a time to implement the
-
issue and put it into our production
-
environment. And then we can verify and
-
test and make sure that the entire
-
system is now working as expected. And, of
-
course, finally, we want to document
-
everything that we did from the very
-
beginning of our troubleshooting process
-
all the way through to the end.