1
00:00:01,550 --> 00:00:03,260
When you're troubleshooting complex

2
00:00:03,260 --> 00:00:05,330
network problems, you may find that the

3
00:00:05,330 --> 00:00:07,460
resolution is not as obvious as you

4
00:00:07,460 --> 00:00:09,440
might hope. In this video, we're going to

5
00:00:09,440 --> 00:00:11,750
step through a methodology that should

6
00:00:11,750 --> 00:00:13,610
help you troubleshoot any problem you

7
00:00:13,610 --> 00:00:16,609
run into. This is the flowchart of that

8
00:00:16,609 --> 00:00:18,650
network troubleshooting methodology, and

9
00:00:18,650 --> 00:00:20,660
we're going to step through each section

10
00:00:20,660 --> 00:00:23,120
of this flow and describe how it can

11
00:00:23,120 --> 00:00:24,980
help you solve those really difficult

12
00:00:24,980 --> 00:00:27,619
problems. The first thing you want to do

13
00:00:27,619 --> 00:00:29,930
is identify the problem. This may not be

14
00:00:29,930 --> 00:00:32,270
as straightforward as you might think.

15
00:00:32,270 --> 00:00:33,890
We first need to collect as much

16
00:00:33,890 --> 00:00:36,170
information as possible about the issue

17
00:00:36,170 --> 00:00:38,270
that's occurring. In the best possible

18
00:00:38,270 --> 00:00:39,920
scenario, you'll be able to duplicate

19
00:00:39,920 --> 00:00:42,260
this problem on demand. This will help

20
00:00:42,260 --> 00:00:43,760
later as we go through a number of

21
00:00:43,760 --> 00:00:45,950
testing phases to make sure that we are

22
00:00:45,950 --> 00:00:48,290
able to resolve this issue. When a

23
00:00:48,290 --> 00:00:50,149
problem happens on the network, it

24
00:00:50,149 --> 00:00:52,610
usually affects more than one device, and

25
00:00:52,610 --> 00:00:55,130
sometimes it affects those devices in

26
00:00:55,130 --> 00:00:57,020
different ways. You want to be sure to

27
00:00:57,020 --> 00:00:59,149
document all of the symptoms that may be

28
00:00:59,149 --> 00:01:01,219
occurring. Even if they are very

29
00:01:01,219 --> 00:01:03,440
different between different devices, you

30
00:01:03,440 --> 00:01:05,180
may find that a single problem is

31
00:01:05,180 --> 00:01:07,009
causing all of these different systems

32
00:01:07,009 --> 00:01:09,799
across these different devices. Many

33
00:01:09,799 --> 00:01:11,719
times, these issues will be identified by

34
00:01:11,719 --> 00:01:13,609
the end users, so they may be able to

35
00:01:13,609 --> 00:01:15,979
provide you with a lot more detail about

36
00:01:15,979 --> 00:01:17,539
what's really happening. You should

37
00:01:17,539 --> 00:01:19,310
question your users to find out what

38
00:01:19,310 --> 00:01:21,049
they're seeing and if any error messages

39
00:01:21,049 --> 00:01:23,240
are appearing. In this course, we've

40
00:01:23,240 --> 00:01:24,979
already discussed the importance of the

41
00:01:24,979 --> 00:01:27,020
change control process and knowing

42
00:01:27,020 --> 00:01:28,670
exactly what is changing in your

43
00:01:28,670 --> 00:01:31,279
environment. Without some type of formal

44
00:01:31,279 --> 00:01:33,259
change control process, someone may be

45
00:01:33,259 --> 00:01:35,450
able to make an unscheduled change that

46
00:01:35,450 --> 00:01:37,310
would affect many different people. So,

47
00:01:37,310 --> 00:01:39,289
when an error or network problem occurs,

48
00:01:39,289 --> 00:01:41,539
you may want to find out what the

49
00:01:41,539 --> 00:01:43,609
last thing was that changed on this network

50
00:01:43,609 --> 00:01:45,229
that could have affected all of these

51
00:01:45,229 --> 00:01:47,659
users. There's also going to be times

52
00:01:47,659 --> 00:01:49,729
when you're examining a number of

53
00:01:49,729 --> 00:01:52,039
different problems that may not actually

54
00:01:52,039 --> 00:01:54,439
be related to each other. It's always

55
00:01:54,439 --> 00:01:56,149
best to separate all of these different

56
00:01:56,149 --> 00:01:58,399
issues out so that you can approach and

57
00:01:58,399 --> 00:02:01,179
try to resolve each issue individually.

58
00:02:01,179 --> 00:02:03,409
Now that you've collected as much

59
00:02:03,409 --> 00:02:04,939
information as possible,

60
00:02:04,939 --> 00:02:07,310
you can examine all of these details to

61
00:02:07,310 --> 00:02:09,680
begin establishing a theory of what you

62
00:02:09,680 --> 00:02:11,810
think might be going wrong. Since the

63
00:02:11,810 --> 00:02:14,030
simpler explanation is often the most

64
00:02:14,030 --> 00:02:15,110
likely reason

65
00:02:15,110 --> 00:02:17,540
for the issue, that may be a good place

66
00:02:17,540 --> 00:02:19,580
to start. But, of course, you'll want to

67
00:02:19,580 --> 00:02:21,890
consider every possible thing that might

68
00:02:21,890 --> 00:02:24,080
be causing this issue. Maybe start with

69
00:02:24,080 --> 00:02:26,000
things that aren't completely obvious.

70
00:02:26,000 --> 00:02:28,400
You could start from the top of the OSI

71
00:02:28,400 --> 00:02:30,350
model with the way the application is

72
00:02:30,350 --> 00:02:32,390
working and work your way to the bottom.

73
00:02:32,390 --> 00:02:34,280
Or, you may want to start with the bottom

74
00:02:34,280 --> 00:02:36,260
with the cabling and wiring in your

75
00:02:36,260 --> 00:02:38,330
infrastructure and work your way up from

76
00:02:38,330 --> 00:02:40,600
there. You'll want to list out every

77
00:02:40,600 --> 00:02:43,490
possible cause for this problem. Your

78
00:02:43,490 --> 00:02:45,170
list might start with the easy theories

79
00:02:45,170 --> 00:02:47,240
at the top, but of course, include all of

80
00:02:47,240 --> 00:02:49,160
the more complex theories in this list

81
00:02:49,160 --> 00:02:51,800
as well. Now that we have a list of

82
00:02:51,800 --> 00:02:54,110
theories on how to resolve this issue, we

83
00:02:54,110 --> 00:02:56,209
can now test those theories. We may want

84
00:02:56,209 --> 00:02:58,489
to go into a lab. And if we are able to

85
00:02:58,489 --> 00:03:00,920
recreate this problem in the lab, then we

86
00:03:00,920 --> 00:03:03,800
can apply each theory until we find the

87
00:03:03,800 --> 00:03:05,840
one that happens to resolve the issue. If

88
00:03:05,840 --> 00:03:08,150
you tried the first theory, you may want

89
00:03:08,150 --> 00:03:10,070
to reset everything and try the second

90
00:03:10,070 --> 00:03:12,050
theory or the third. And if you run out

91
00:03:12,050 --> 00:03:13,790
of theories, you may want to go back and

92
00:03:13,790 --> 00:03:15,500
think of other things that might be

93
00:03:15,500 --> 00:03:17,540
causing this problem. This might be a

94
00:03:17,540 --> 00:03:19,340
good time to bring in an expert who

95
00:03:19,340 --> 00:03:20,840
knows about the application or the

96
00:03:20,840 --> 00:03:22,700
infrastructure, and they can give some

97
00:03:22,700 --> 00:03:24,890
theories and possible resolutions to

98
00:03:24,890 --> 00:03:27,590
test in the lab. Once you've tested a

99
00:03:27,590 --> 00:03:29,299
theory and found that the theory is

100
00:03:29,299 --> 00:03:31,489
going to resolve this issue, you can then

101
00:03:31,489 --> 00:03:33,380
begin putting together a plan of action.

102
00:03:33,380 --> 00:03:35,600
This is how you would implement this fix

103
00:03:35,600 --> 00:03:38,060
into a production network. You want to be

104
00:03:38,060 --> 00:03:39,920
sure that you're able to do this with a

105
00:03:39,920 --> 00:03:41,780
minimum amount of impact to the

106
00:03:41,780 --> 00:03:43,880
production network. And sometimes, you

107
00:03:43,880 --> 00:03:45,860
have to do this after hours when nobody

108
00:03:45,860 --> 00:03:48,260
else is working on the network. You want

109
00:03:48,260 --> 00:03:49,640
to be able to implement this with a

110
00:03:49,640 --> 00:03:52,250
minimum amount of impact to production

111
00:03:52,250 --> 00:03:54,380
traffic. So often, you'll have to do this

112
00:03:54,380 --> 00:03:57,410
after hours. A best practice is to

113
00:03:57,410 --> 00:03:59,510
document the exact steps that will be

114
00:03:59,510 --> 00:04:00,830
required to solve this particular

115
00:04:00,830 --> 00:04:03,560
problem. If it's replacing a cable, then

116
00:04:03,560 --> 00:04:04,970
the process will be relatively

117
00:04:04,970 --> 00:04:06,680
straightforward. But if you're upgrading

118
00:04:06,680 --> 00:04:08,989
software in a switch, a router, or a

119
00:04:08,989 --> 00:04:11,750
firewall, there may be additional tasks

120
00:04:11,750 --> 00:04:13,579
involved in performing this plan of

121
00:04:13,579 --> 00:04:15,500
action. You'll also want some

122
00:04:15,500 --> 00:04:17,540
alternatives if your plan doesn't go as

123
00:04:17,540 --> 00:04:19,700
designed. For example, you may run into

124
00:04:19,700 --> 00:04:21,590
problems when upgrading the software in

125
00:04:21,590 --> 00:04:23,599
a firewall. So, you may need an additional

126
00:04:23,599 --> 00:04:26,150
firewall or a way to roll back to the

127
00:04:26,150 --> 00:04:27,250
previous version.

128
00:04:27,250 --> 00:04:28,580
Now that you've

129
00:04:28,580 --> 00:04:30,229
documented your plan of action, you can

130
00:04:30,229 --> 00:04:31,729
take that to your change control team,

131
00:04:31,729 --> 00:04:33,770
and they can give you a window when you

132
00:04:33,770 --> 00:04:35,900
can implement that change. The actual

133
00:04:35,900 --> 00:04:38,240
fixing of the issue is probably going to

134
00:04:38,240 --> 00:04:40,769
be during off hours, during non-production

135
00:04:40,769 --> 00:04:41,990
times, and you may need to

136
00:04:41,990 --> 00:04:43,729
bring in other people to assist,

137
00:04:43,729 --> 00:04:46,539
especially if your window is very small.

138
00:04:46,539 --> 00:04:48,860
Once you have executed on your plan of

139
00:04:48,860 --> 00:04:50,719
action, your job isn't done yet.

140
00:04:50,719 --> 00:04:52,520
We need to make sure that all of these

141
00:04:52,520 --> 00:04:55,460
changes actually resolve the problem. So,

142
00:04:55,460 --> 00:04:56,750
now that the changes have been

143
00:04:56,750 --> 00:04:59,000
implemented, we now need to perform some

144
00:04:59,000 --> 00:05:01,190
tests. We may want to bring in the end

145
00:05:01,190 --> 00:05:03,139
users who first experienced this problem

146
00:05:03,139 --> 00:05:05,719
so that they can run through exactly the

147
00:05:05,719 --> 00:05:07,789
same scenario to tell you if the problem

148
00:05:07,789 --> 00:05:10,530
is resolved or if the problem still exists.

149
00:05:10,530 --> 00:05:12,379
This might also be a good time to

150
00:05:12,379 --> 00:05:14,240
implement some preventive measures. That

151
00:05:14,240 --> 00:05:16,460
way, we can either be informed that the

152
00:05:16,460 --> 00:05:18,430
problem is occurring, or we can provide

153
00:05:18,430 --> 00:05:20,569
alternatives that we can implement if

154
00:05:20,569 --> 00:05:23,719
that problem happens again. After the

155
00:05:23,719 --> 00:05:25,430
problem has been resolved, this is a

156
00:05:25,430 --> 00:05:27,439
perfect time to document the entire

157
00:05:27,439 --> 00:05:29,750
process from the very beginning to the

158
00:05:29,750 --> 00:05:31,639
very end. You'll, of course, want to

159
00:05:31,639 --> 00:05:33,379
provide as much information as possible.

160
00:05:33,379 --> 00:05:35,779
So, if somebody runs into this issue

161
00:05:35,779 --> 00:05:37,940
again, they can simply search your

162
00:05:37,940 --> 00:05:40,159
knowledge base, find that particular error

163
00:05:40,159 --> 00:05:42,050
that popped up, and know exactly the

164
00:05:42,050 --> 00:05:44,529
process you used to solve this last time.

165
00:05:44,529 --> 00:05:47,539
Many organizations have a help desk with

166
00:05:47,539 --> 00:05:49,250
case notes that they can reference, or

167
00:05:49,250 --> 00:05:50,930
you might have a separate knowledge base

168
00:05:50,930 --> 00:05:53,000
or wiki that you create where you're

169
00:05:53,000 --> 00:05:54,440
storing all of this important

170
00:05:54,440 --> 00:05:57,349
information for the future. A document

171
00:05:57,349 --> 00:05:58,969
that was created a number of years ago

172
00:05:58,969 --> 00:06:00,919
but still shows the importance of

173
00:06:00,919 --> 00:06:02,900
keeping this documentation over time is

174
00:06:02,900 --> 00:06:04,430
from Google Research, where they

175
00:06:04,430 --> 00:06:06,830
documented the failure trends in a large

176
00:06:06,830 --> 00:06:09,229
disk drive population. And because they

177
00:06:09,229 --> 00:06:11,449
were keeping extensive data over a long

178
00:06:11,449 --> 00:06:13,789
period of time, they were able to tell

179
00:06:13,789 --> 00:06:16,250
when a drive was starting to fail based

180
00:06:16,250 --> 00:06:17,990
on the types of errors that they were

181
00:06:17,990 --> 00:06:20,360
receiving. Being able to store all of

182
00:06:20,360 --> 00:06:22,039
this important information, being

183
00:06:22,039 --> 00:06:23,990
able to go back in time to see what

184
00:06:23,990 --> 00:06:26,270
happened, becomes a very important part

185
00:06:26,270 --> 00:06:28,629
of maintaining a network for the future.

186
00:06:28,629 --> 00:06:31,159
Let's summarize this troubleshooting

187
00:06:31,159 --> 00:06:33,050
methodology. We start with gathering as

188
00:06:33,050 --> 00:06:35,449
much information as possible, asking

189
00:06:35,449 --> 00:06:37,190
users about what they're seeing, and

190
00:06:37,190 --> 00:06:39,500
documenting any specific error messages.

191
00:06:39,500 --> 00:06:41,330
Then, we want to be able to create a

192
00:06:41,330 --> 00:06:42,050
number of

193
00:06:42,050 --> 00:06:43,789
theories that might solve this particular

194
00:06:43,789 --> 00:06:46,759
problem. And once we have this list, we

195
00:06:46,759 --> 00:06:48,379
want to be able to put it in the lab and

196
00:06:48,379 --> 00:06:50,389
try testing each one of these theories

197
00:06:50,389 --> 00:06:52,340
until we find the one that actually

198
00:06:52,340 --> 00:06:55,009
resolves the issue. From there, we can

199
00:06:55,009 --> 00:06:57,169
create a plan of action and document any

200
00:06:57,169 --> 00:06:59,599
possible problems that might occur. We

201
00:06:59,599 --> 00:07:01,280
can then get a time to implement the

202
00:07:01,280 --> 00:07:03,530
issue and put it into our production

203
00:07:03,530 --> 00:07:05,659
environment. And then we can verify and

204
00:07:05,659 --> 00:07:07,789
test and make sure that the entire

205
00:07:07,789 --> 00:07:10,729
system is now working as expected. And, of

206
00:07:10,729 --> 00:07:12,199
course, finally, we want to document

207
00:07:12,199 --> 00:07:14,780
everything that we did from the very

208
00:07:14,780 --> 00:07:16,729
beginning of our troubleshooting process

209
00:07:16,729 --> 00:07:19,387
all the way through to the end.