1 00:00:01,550 --> 00:00:03,260 When you're troubleshooting complex 2 00:00:03,260 --> 00:00:05,330 network problems, you may find that the 3 00:00:05,330 --> 00:00:07,460 resolution is not as obvious as you 4 00:00:07,460 --> 00:00:09,440 might hope. In this video, we're going to 5 00:00:09,440 --> 00:00:11,750 step through a methodology that should 6 00:00:11,750 --> 00:00:13,610 help you troubleshoot any problem you 7 00:00:13,610 --> 00:00:16,609 run into. This is the flowchart of that 8 00:00:16,609 --> 00:00:18,650 network troubleshooting methodology, and 9 00:00:18,650 --> 00:00:20,660 we're going to step through each section 10 00:00:20,660 --> 00:00:23,120 of this flow and describe how it can 11 00:00:23,120 --> 00:00:24,980 help you solve those really difficult 12 00:00:24,980 --> 00:00:27,619 problems. The first thing you want to do 13 00:00:27,619 --> 00:00:29,930 is identify the problem. This may not be 14 00:00:29,930 --> 00:00:32,270 as straightforward as you might think. 15 00:00:32,270 --> 00:00:33,890 We first need to collect as much 16 00:00:33,890 --> 00:00:36,170 information as possible about the issue 17 00:00:36,170 --> 00:00:38,270 that's occurring. In the best possible 18 00:00:38,270 --> 00:00:39,920 scenario, you'll be able to duplicate 19 00:00:39,920 --> 00:00:42,260 this problem on demand. This will help 20 00:00:42,260 --> 00:00:43,760 later as we go through a number of 21 00:00:43,760 --> 00:00:45,950 testing phases to make sure that we are 22 00:00:45,950 --> 00:00:48,290 able to resolve this issue. When a 23 00:00:48,290 --> 00:00:50,149 problem happens on the network, it 24 00:00:50,149 --> 00:00:52,610 usually affects more than one device, and 25 00:00:52,610 --> 00:00:55,130 sometimes it affects those devices in 26 00:00:55,130 --> 00:00:57,020 different ways. You want to be sure to 27 00:00:57,020 --> 00:00:59,149 document all of the symptoms that may be 28 00:00:59,149 --> 00:01:01,219 occurring. Even if they are very 29 00:01:01,219 --> 00:01:03,440 different between different devices, you 30 00:01:03,440 --> 00:01:05,180 may find that a single problem is 31 00:01:05,180 --> 00:01:07,009 causing all of these different systems 32 00:01:07,009 --> 00:01:09,799 across these different devices. Many 33 00:01:09,799 --> 00:01:11,719 times, these issues will be identified by 34 00:01:11,719 --> 00:01:13,609 the end users, so they may be able to 35 00:01:13,609 --> 00:01:15,979 provide you with a lot more detail about 36 00:01:15,979 --> 00:01:17,539 what's really happening. You should 37 00:01:17,539 --> 00:01:19,310 question your users to find out what 38 00:01:19,310 --> 00:01:21,049 they're seeing and if any error messages 39 00:01:21,049 --> 00:01:23,240 are appearing. In this course, we've 40 00:01:23,240 --> 00:01:24,979 already discussed the importance of the 41 00:01:24,979 --> 00:01:27,020 change control process and knowing 42 00:01:27,020 --> 00:01:28,670 exactly what is changing in your 43 00:01:28,670 --> 00:01:31,279 environment. Without some type of formal 44 00:01:31,279 --> 00:01:33,259 change control process, someone may be 45 00:01:33,259 --> 00:01:35,450 able to make an unscheduled change that 46 00:01:35,450 --> 00:01:37,310 would affect many different people. So, 47 00:01:37,310 --> 00:01:39,289 when an error or network problem occurs, 48 00:01:39,289 --> 00:01:41,539 you may want to find out what the 49 00:01:41,539 --> 00:01:43,609 last thing was that changed on this network 50 00:01:43,609 --> 00:01:45,229 that could have affected all of these 51 00:01:45,229 --> 00:01:47,659 users. There's also going to be times 52 00:01:47,659 --> 00:01:49,729 when you're examining a number of 53 00:01:49,729 --> 00:01:52,039 different problems that may not actually 54 00:01:52,039 --> 00:01:54,439 be related to each other. It's always 55 00:01:54,439 --> 00:01:56,149 best to separate all of these different 56 00:01:56,149 --> 00:01:58,399 issues out so that you can approach and 57 00:01:58,399 --> 00:02:01,179 try to resolve each issue individually. 58 00:02:01,179 --> 00:02:03,409 Now that you've collected as much 59 00:02:03,409 --> 00:02:04,939 information as possible, 60 00:02:04,939 --> 00:02:07,310 you can examine all of these details to 61 00:02:07,310 --> 00:02:09,680 begin establishing a theory of what you 62 00:02:09,680 --> 00:02:11,810 think might be going wrong. Since the 63 00:02:11,810 --> 00:02:14,030 simpler explanation is often the most 64 00:02:14,030 --> 00:02:15,110 likely reason 65 00:02:15,110 --> 00:02:17,540 for the issue, that may be a good place 66 00:02:17,540 --> 00:02:19,580 to start. But, of course, you'll want to 67 00:02:19,580 --> 00:02:21,890 consider every possible thing that might 68 00:02:21,890 --> 00:02:24,080 be causing this issue. Maybe start with 69 00:02:24,080 --> 00:02:26,000 things that aren't completely obvious. 70 00:02:26,000 --> 00:02:28,400 You could start from the top of the OSI 71 00:02:28,400 --> 00:02:30,350 model with the way the application is 72 00:02:30,350 --> 00:02:32,390 working and work your way to the bottom. 73 00:02:32,390 --> 00:02:34,280 Or, you may want to start with the bottom 74 00:02:34,280 --> 00:02:36,260 with the cabling and wiring in your 75 00:02:36,260 --> 00:02:38,330 infrastructure and work your way up from 76 00:02:38,330 --> 00:02:40,600 there. You'll want to list out every 77 00:02:40,600 --> 00:02:43,490 possible cause for this problem. Your 78 00:02:43,490 --> 00:02:45,170 list might start with the easy theories 79 00:02:45,170 --> 00:02:47,240 at the top, but of course, include all of 80 00:02:47,240 --> 00:02:49,160 the more complex theories in this list 81 00:02:49,160 --> 00:02:51,800 as well. Now that we have a list of 82 00:02:51,800 --> 00:02:54,110 theories on how to resolve this issue, we 83 00:02:54,110 --> 00:02:56,209 can now test those theories. We may want 84 00:02:56,209 --> 00:02:58,489 to go into a lab. And if we are able to 85 00:02:58,489 --> 00:03:00,920 recreate this problem in the lab, then we 86 00:03:00,920 --> 00:03:03,800 can apply each theory until we find the 87 00:03:03,800 --> 00:03:05,840 one that happens to resolve the issue. If 88 00:03:05,840 --> 00:03:08,150 you tried the first theory, you may want 89 00:03:08,150 --> 00:03:10,070 to reset everything and try the second 90 00:03:10,070 --> 00:03:12,050 theory or the third. And if you run out 91 00:03:12,050 --> 00:03:13,790 of theories, you may want to go back and 92 00:03:13,790 --> 00:03:15,500 think of other things that might be 93 00:03:15,500 --> 00:03:17,540 causing this problem. This might be a 94 00:03:17,540 --> 00:03:19,340 good time to bring in an expert who 95 00:03:19,340 --> 00:03:20,840 knows about the application or the 96 00:03:20,840 --> 00:03:22,700 infrastructure, and they can give some 97 00:03:22,700 --> 00:03:24,890 theories and possible resolutions to 98 00:03:24,890 --> 00:03:27,590 test in the lab. Once you've tested a 99 00:03:27,590 --> 00:03:29,299 theory and found that the theory is 100 00:03:29,299 --> 00:03:31,489 going to resolve this issue, you can then 101 00:03:31,489 --> 00:03:33,380 begin putting together a plan of action. 102 00:03:33,380 --> 00:03:35,600 This is how you would implement this fix 103 00:03:35,600 --> 00:03:38,060 into a production network. You want to be 104 00:03:38,060 --> 00:03:39,920 sure that you're able to do this with a 105 00:03:39,920 --> 00:03:41,780 minimum amount of impact to the 106 00:03:41,780 --> 00:03:43,880 production network. And sometimes, you 107 00:03:43,880 --> 00:03:45,860 have to do this after hours when nobody 108 00:03:45,860 --> 00:03:48,260 else is working on the network. You want 109 00:03:48,260 --> 00:03:49,640 to be able to implement this with a 110 00:03:49,640 --> 00:03:52,250 minimum amount of impact to production 111 00:03:52,250 --> 00:03:54,380 traffic. So often, you'll have to do this 112 00:03:54,380 --> 00:03:57,410 after hours. A best practice is to 113 00:03:57,410 --> 00:03:59,510 document the exact steps that will be 114 00:03:59,510 --> 00:04:00,830 required to solve this particular 115 00:04:00,830 --> 00:04:03,560 problem. If it's replacing a cable, then 116 00:04:03,560 --> 00:04:04,970 the process will be relatively 117 00:04:04,970 --> 00:04:06,680 straightforward. But if you're upgrading 118 00:04:06,680 --> 00:04:08,989 software in a switch, a router, or a 119 00:04:08,989 --> 00:04:11,750 firewall, there may be additional tasks 120 00:04:11,750 --> 00:04:13,579 involved in performing this plan of 121 00:04:13,579 --> 00:04:15,500 action. You'll also want some 122 00:04:15,500 --> 00:04:17,540 alternatives if your plan doesn't go as 123 00:04:17,540 --> 00:04:19,700 designed. For example, you may run into 124 00:04:19,700 --> 00:04:21,590 problems when upgrading the software in 125 00:04:21,590 --> 00:04:23,599 a firewall. So, you may need an additional 126 00:04:23,599 --> 00:04:26,150 firewall or a way to roll back to the 127 00:04:26,150 --> 00:04:27,250 previous version. 128 00:04:27,250 --> 00:04:28,580 Now that you've 129 00:04:28,580 --> 00:04:30,229 documented your plan of action, you can 130 00:04:30,229 --> 00:04:31,729 take that to your change control team, 131 00:04:31,729 --> 00:04:33,770 and they can give you a window when you 132 00:04:33,770 --> 00:04:35,900 can implement that change. The actual 133 00:04:35,900 --> 00:04:38,240 fixing of the issue is probably going to 134 00:04:38,240 --> 00:04:40,769 be during off hours, during non-production 135 00:04:40,769 --> 00:04:41,990 times, and you may need to 136 00:04:41,990 --> 00:04:43,729 bring in other people to assist, 137 00:04:43,729 --> 00:04:46,539 especially if your window is very small. 138 00:04:46,539 --> 00:04:48,860 Once you have executed on your plan of 139 00:04:48,860 --> 00:04:50,719 action, your job isn't done yet. 140 00:04:50,719 --> 00:04:52,520 We need to make sure that all of these 141 00:04:52,520 --> 00:04:55,460 changes actually resolve the problem. So, 142 00:04:55,460 --> 00:04:56,750 now that the changes have been 143 00:04:56,750 --> 00:04:59,000 implemented, we now need to perform some 144 00:04:59,000 --> 00:05:01,190 tests. We may want to bring in the end 145 00:05:01,190 --> 00:05:03,139 users who first experienced this problem 146 00:05:03,139 --> 00:05:05,719 so that they can run through exactly the 147 00:05:05,719 --> 00:05:07,789 same scenario to tell you if the problem 148 00:05:07,789 --> 00:05:10,530 is resolved or if the problem still exists. 149 00:05:10,530 --> 00:05:12,379 This might also be a good time to 150 00:05:12,379 --> 00:05:14,240 implement some preventive measures. That 151 00:05:14,240 --> 00:05:16,460 way, we can either be informed that the 152 00:05:16,460 --> 00:05:18,430 problem is occurring, or we can provide 153 00:05:18,430 --> 00:05:20,569 alternatives that we can implement if 154 00:05:20,569 --> 00:05:23,719 that problem happens again. After the 155 00:05:23,719 --> 00:05:25,430 problem has been resolved, this is a 156 00:05:25,430 --> 00:05:27,439 perfect time to document the entire 157 00:05:27,439 --> 00:05:29,750 process from the very beginning to the 158 00:05:29,750 --> 00:05:31,639 very end. You'll, of course, want to 159 00:05:31,639 --> 00:05:33,379 provide as much information as possible. 160 00:05:33,379 --> 00:05:35,779 So, if somebody runs into this issue 161 00:05:35,779 --> 00:05:37,940 again, they can simply search your 162 00:05:37,940 --> 00:05:40,159 knowledge base, find that particular error 163 00:05:40,159 --> 00:05:42,050 that popped up, and know exactly the 164 00:05:42,050 --> 00:05:44,529 process you used to solve this last time. 165 00:05:44,529 --> 00:05:47,539 Many organizations have a help desk with 166 00:05:47,539 --> 00:05:49,250 case notes that they can reference, or 167 00:05:49,250 --> 00:05:50,930 you might have a separate knowledge base 168 00:05:50,930 --> 00:05:53,000 or wiki that you create where you're 169 00:05:53,000 --> 00:05:54,440 storing all of this important 170 00:05:54,440 --> 00:05:57,349 information for the future. A document 171 00:05:57,349 --> 00:05:58,969 that was created a number of years ago 172 00:05:58,969 --> 00:06:00,919 but still shows the importance of 173 00:06:00,919 --> 00:06:02,900 keeping this documentation over time is 174 00:06:02,900 --> 00:06:04,430 from Google Research, where they 175 00:06:04,430 --> 00:06:06,830 documented the failure trends in a large 176 00:06:06,830 --> 00:06:09,229 disk drive population. And because they 177 00:06:09,229 --> 00:06:11,449 were keeping extensive data over a long 178 00:06:11,449 --> 00:06:13,789 period of time, they were able to tell 179 00:06:13,789 --> 00:06:16,250 when a drive was starting to fail based 180 00:06:16,250 --> 00:06:17,990 on the types of errors that they were 181 00:06:17,990 --> 00:06:20,360 receiving. Being able to store all of 182 00:06:20,360 --> 00:06:22,039 this important information, being 183 00:06:22,039 --> 00:06:23,990 able to go back in time to see what 184 00:06:23,990 --> 00:06:26,270 happened, becomes a very important part 185 00:06:26,270 --> 00:06:28,629 of maintaining a network for the future. 186 00:06:28,629 --> 00:06:31,159 Let's summarize this troubleshooting 187 00:06:31,159 --> 00:06:33,050 methodology. We start with gathering as 188 00:06:33,050 --> 00:06:35,449 much information as possible, asking 189 00:06:35,449 --> 00:06:37,190 users about what they're seeing, and 190 00:06:37,190 --> 00:06:39,500 documenting any specific error messages. 191 00:06:39,500 --> 00:06:41,330 Then, we want to be able to create a 192 00:06:41,330 --> 00:06:42,050 number of 193 00:06:42,050 --> 00:06:43,789 theories that might solve this particular 194 00:06:43,789 --> 00:06:46,759 problem. And once we have this list, we 195 00:06:46,759 --> 00:06:48,379 want to be able to put it in the lab and 196 00:06:48,379 --> 00:06:50,389 try testing each one of these theories 197 00:06:50,389 --> 00:06:52,340 until we find the one that actually 198 00:06:52,340 --> 00:06:55,009 resolves the issue. From there, we can 199 00:06:55,009 --> 00:06:57,169 create a plan of action and document any 200 00:06:57,169 --> 00:06:59,599 possible problems that might occur. We 201 00:06:59,599 --> 00:07:01,280 can then get a time to implement the 202 00:07:01,280 --> 00:07:03,530 issue and put it into our production 203 00:07:03,530 --> 00:07:05,659 environment. And then we can verify and 204 00:07:05,659 --> 00:07:07,789 test and make sure that the entire 205 00:07:07,789 --> 00:07:10,729 system is now working as expected. And, of 206 00:07:10,729 --> 00:07:12,199 course, finally, we want to document 207 00:07:12,199 --> 00:07:14,780 everything that we did from the very 208 00:07:14,780 --> 00:07:16,729 beginning of our troubleshooting process 209 00:07:16,729 --> 00:07:19,387 all the way through to the end.