When you're troubleshooting complex
network problems, you may find that the
resolution is not as obvious as you
might hope. In this video, we're going to
step through a methodology that should
help you troubleshoot any problem you
run into. This is the flowchart of that
network troubleshooting methodology, and
we're going to step through each section
of this flow and describe how it can
help you solve those really difficult
problems. The first thing you want to do
is identify the problem. This may not be
as straightforward as you might think.
We first need to collect as much
information as possible about the issue
that's occurring. In the best possible
scenario, you'll be able to duplicate
this problem on demand. This will help
later as we go through a number of
testing phases to make sure that we are
able to resolve this issue. When a
problem happens on the network, it
usually affects more than one device, and
sometimes it affects those devices in
different ways. You want to be sure to
document all of the symptoms that may be
occurring. Even if they are very
different between different devices, you
may find that a single problem is
causing all of these different systems
across these different devices. Many
times, these issues will be identified by
the end users, so they may be able to
provide you with a lot more detail about
what's really happening. You should
question your users to find out what
they're seeing and if any error messages
are appearing. In this course, we've
already discussed the importance of the
change control process and knowing
exactly what is changing in your
environment. Without some type of formal
change control process, someone may be
able to make an unscheduled change that
would affect many different people. So,
when an error or network problem occurs,
you may want to find out what the
last thing was that changed on this network
that could have affected all of these
users. There's also going to be times
when you're examining a number of
different problems that may not actually
be related to each other. It's always
best to separate all of these different
issues out so that you can approach and
try to resolve each issue individually.
Now that you've collected as much
information as possible,
you can examine all of these details to
begin establishing a theory of what you
think might be going wrong. Since the
simpler explanation is often the most
likely reason
for the issue, that may be a good place
to start. But, of course, you'll want to
consider every possible thing that might
be causing this issue. Maybe start with
things that aren't completely obvious.
You could start from the top of the OSI
model with the way the application is
working and work your way to the bottom.
Or, you may want to start with the bottom
with the cabling and wiring in your
infrastructure and work your way up from
there. You'll want to list out every
possible cause for this problem. Your
list might start with the easy theories
at the top, but of course, include all of
the more complex theories in this list
as well. Now that we have a list of
theories on how to resolve this issue, we
can now test those theories. We may want
to go into a lab. And if we are able to
recreate this problem in the lab, then we
can apply each theory until we find the
one that happens to resolve the issue. If
you tried the first theory, you may want
to reset everything and try the second
theory or the third. And if you run out
of theories, you may want to go back and
think of other things that might be
causing this problem. This might be a
good time to bring in an expert who
knows about the application or the
infrastructure, and they can give some
theories and possible resolutions to
test in the lab. Once you've tested a
theory and found that the theory is
going to resolve this issue, you can then
begin putting together a plan of action.
This is how you would implement this fix
into a production network. You want to be
sure that you're able to do this with a
minimum amount of impact to the
production network. And sometimes, you
have to do this after hours when nobody
else is working on the network. You want
to be able to implement this with a
minimum amount of impact to production
traffic. So often, you'll have to do this
after hours. A best practice is to
document the exact steps that will be
required to solve this particular
problem. If it's replacing a cable, then
the process will be relatively
straightforward. But if you're upgrading
software in a switch, a router, or a
firewall, there may be additional tasks
involved in performing this plan of
action. You'll also want some
alternatives if your plan doesn't go as
designed. For example, you may run into
problems when upgrading the software in
a firewall. So, you may need an additional
firewall or a way to roll back to the
previous version.
Now that you've
documented your plan of action, you can
take that to your change control team,
and they can give you a window when you
can implement that change. The actual
fixing of the issue is probably going to
be during off hours, during non-production
times, and you may need to
bring in other people to assist,
especially if your window is very small.
Once you have executed on your plan of
action, your job isn't done yet.
We need to make sure that all of these
changes actually resolve the problem. So,
now that the changes have been
implemented, we now need to perform some
tests. We may want to bring in the end
users who first experienced this problem
so that they can run through exactly the
same scenario to tell you if the problem
is resolved or if the problem still exists.
This might also be a good time to
implement some preventive measures. That
way, we can either be informed that the
problem is occurring, or we can provide
alternatives that we can implement if
that problem happens again. After the
problem has been resolved, this is a
perfect time to document the entire
process from the very beginning to the
very end. You'll, of course, want to
provide as much information as possible.
So, if somebody runs into this issue
again, they can simply search your
knowledge base, find that particular error
that popped up, and know exactly the
process you used to solve this last time.
Many organizations have a help desk with
case notes that they can reference, or
you might have a separate knowledge base
or wiki that you create where you're
storing all of this important
information for the future. A document
that was created a number of years ago
but still shows the importance of
keeping this documentation over time is
from Google Research, where they
documented the failure trends in a large
disk drive population. And because they
were keeping extensive data over a long
period of time, they were able to tell
when a drive was starting to fail based
on the types of errors that they were
receiving. Being able to store all of
this important information, being
able to go back in time to see what
happened, becomes a very important part
of maintaining a network for the future.
Let's summarize this troubleshooting
methodology. We start with gathering as
much information as possible, asking
users about what they're seeing, and
documenting any specific error messages.
Then, we want to be able to create a
number of
theories that might solve this particular
problem. And once we have this list, we
want to be able to put it in the lab and
try testing each one of these theories
until we find the one that actually
resolves the issue. From there, we can
create a plan of action and document any
possible problems that might occur. We
can then get a time to implement the
issue and put it into our production
environment. And then we can verify and
test and make sure that the entire
system is now working as expected. And, of
course, finally, we want to document
everything that we did from the very
beginning of our troubleshooting process
all the way through to the end.