WEBVTT 00:00:01.550 --> 00:00:03.260 When you're troubleshooting complex 00:00:03.260 --> 00:00:05.330 network problems, you may find that the 00:00:05.330 --> 00:00:07.460 resolution is not as obvious as you 00:00:07.460 --> 00:00:09.440 might hope. In this video, we're going to 00:00:09.440 --> 00:00:11.750 step through a methodology that should 00:00:11.750 --> 00:00:13.610 help you troubleshoot any problem you 00:00:13.610 --> 00:00:16.609 run into. This is the flowchart of that 00:00:16.609 --> 00:00:18.650 network troubleshooting methodology, and 00:00:18.650 --> 00:00:20.660 we're going to step through each section 00:00:20.660 --> 00:00:23.120 of this flow and describe how it can 00:00:23.120 --> 00:00:24.980 help you solve those really difficult 00:00:24.980 --> 00:00:27.619 problems. The first thing you want to do 00:00:27.619 --> 00:00:29.930 is identify the problem. This may not be 00:00:29.930 --> 00:00:32.270 as straightforward as you might think. 00:00:32.270 --> 00:00:33.890 We first need to collect as much 00:00:33.890 --> 00:00:36.170 information as possible about the issue 00:00:36.170 --> 00:00:38.270 that's occurring. In the best possible 00:00:38.270 --> 00:00:39.920 scenario, you'll be able to duplicate 00:00:39.920 --> 00:00:42.260 this problem on demand. This will help 00:00:42.260 --> 00:00:43.760 later as we go through a number of 00:00:43.760 --> 00:00:45.950 testing phases to make sure that we are 00:00:45.950 --> 00:00:48.290 able to resolve this issue. When a 00:00:48.290 --> 00:00:50.149 problem happens on the network, it 00:00:50.149 --> 00:00:52.610 usually affects more than one device, and 00:00:52.610 --> 00:00:55.130 sometimes it affects those devices in 00:00:55.130 --> 00:00:57.020 different ways. You want to be sure to 00:00:57.020 --> 00:00:59.149 document all of the symptoms that may be 00:00:59.149 --> 00:01:01.219 occurring. Even if they are very 00:01:01.219 --> 00:01:03.440 different between different devices, you 00:01:03.440 --> 00:01:05.180 may find that a single problem is 00:01:05.180 --> 00:01:07.009 causing all of these different systems 00:01:07.009 --> 00:01:09.799 across these different devices. Many 00:01:09.799 --> 00:01:11.719 times, these issues will be identified by 00:01:11.719 --> 00:01:13.609 the end users, so they may be able to 00:01:13.609 --> 00:01:15.979 provide you with a lot more detail about 00:01:15.979 --> 00:01:17.539 what's really happening. You should 00:01:17.539 --> 00:01:19.310 question your users to find out what 00:01:19.310 --> 00:01:21.049 they're seeing and if any error messages 00:01:21.049 --> 00:01:23.240 are appearing. In this course, we've 00:01:23.240 --> 00:01:24.979 already discussed the importance of the 00:01:24.979 --> 00:01:27.020 change control process and knowing 00:01:27.020 --> 00:01:28.670 exactly what is changing in your 00:01:28.670 --> 00:01:31.279 environment. Without some type of formal 00:01:31.279 --> 00:01:33.259 change control process, someone may be 00:01:33.259 --> 00:01:35.450 able to make an unscheduled change that 00:01:35.450 --> 00:01:37.310 would affect many different people. So, 00:01:37.310 --> 00:01:39.289 when an error or network problem occurs, 00:01:39.289 --> 00:01:41.539 you may want to find out what the 00:01:41.539 --> 00:01:43.609 last thing was that changed on this network 00:01:43.609 --> 00:01:45.229 that could have affected all of these 00:01:45.229 --> 00:01:47.659 users. There's also going to be times 00:01:47.659 --> 00:01:49.729 when you're examining a number of 00:01:49.729 --> 00:01:52.039 different problems that may not actually 00:01:52.039 --> 00:01:54.439 be related to each other. It's always 00:01:54.439 --> 00:01:56.149 best to separate all of these different 00:01:56.149 --> 00:01:58.399 issues out so that you can approach and 00:01:58.399 --> 00:02:01.179 try to resolve each issue individually. 00:02:01.179 --> 00:02:03.409 Now that you've collected as much 00:02:03.409 --> 00:02:04.939 information as possible, 00:02:04.939 --> 00:02:07.310 you can examine all of these details to 00:02:07.310 --> 00:02:09.680 begin establishing a theory of what you 00:02:09.680 --> 00:02:11.810 think might be going wrong. Since the 00:02:11.810 --> 00:02:14.030 simpler explanation is often the most 00:02:14.030 --> 00:02:15.110 likely reason 00:02:15.110 --> 00:02:17.540 for the issue, that may be a good place 00:02:17.540 --> 00:02:19.580 to start. But, of course, you'll want to 00:02:19.580 --> 00:02:21.890 consider every possible thing that might 00:02:21.890 --> 00:02:24.080 be causing this issue. Maybe start with 00:02:24.080 --> 00:02:26.000 things that aren't completely obvious. 00:02:26.000 --> 00:02:28.400 You could start from the top of the OSI 00:02:28.400 --> 00:02:30.350 model with the way the application is 00:02:30.350 --> 00:02:32.390 working and work your way to the bottom. 00:02:32.390 --> 00:02:34.280 Or, you may want to start with the bottom 00:02:34.280 --> 00:02:36.260 with the cabling and wiring in your 00:02:36.260 --> 00:02:38.330 infrastructure and work your way up from 00:02:38.330 --> 00:02:40.600 there. You'll want to list out every 00:02:40.600 --> 00:02:43.490 possible cause for this problem. Your 00:02:43.490 --> 00:02:45.170 list might start with the easy theories 00:02:45.170 --> 00:02:47.240 at the top, but of course, include all of 00:02:47.240 --> 00:02:49.160 the more complex theories in this list 00:02:49.160 --> 00:02:51.800 as well. Now that we have a list of 00:02:51.800 --> 00:02:54.110 theories on how to resolve this issue, we 00:02:54.110 --> 00:02:56.209 can now test those theories. We may want 00:02:56.209 --> 00:02:58.489 to go into a lab. And if we are able to 00:02:58.489 --> 00:03:00.920 recreate this problem in the lab, then we 00:03:00.920 --> 00:03:03.800 can apply each theory until we find the 00:03:03.800 --> 00:03:05.840 one that happens to resolve the issue. If 00:03:05.840 --> 00:03:08.150 you tried the first theory, you may want 00:03:08.150 --> 00:03:10.070 to reset everything and try the second 00:03:10.070 --> 00:03:12.050 theory or the third. And if you run out 00:03:12.050 --> 00:03:13.790 of theories, you may want to go back and 00:03:13.790 --> 00:03:15.500 think of other things that might be 00:03:15.500 --> 00:03:17.540 causing this problem. This might be a 00:03:17.540 --> 00:03:19.340 good time to bring in an expert who 00:03:19.340 --> 00:03:20.840 knows about the application or the 00:03:20.840 --> 00:03:22.700 infrastructure, and they can give some 00:03:22.700 --> 00:03:24.890 theories and possible resolutions to 00:03:24.890 --> 00:03:27.590 test in the lab. Once you've tested a 00:03:27.590 --> 00:03:29.299 theory and found that the theory is 00:03:29.299 --> 00:03:31.489 going to resolve this issue, you can then 00:03:31.489 --> 00:03:33.380 begin putting together a plan of action. 00:03:33.380 --> 00:03:35.600 This is how you would implement this fix 00:03:35.600 --> 00:03:38.060 into a production network. You want to be 00:03:38.060 --> 00:03:39.920 sure that you're able to do this with a 00:03:39.920 --> 00:03:41.780 minimum amount of impact to the 00:03:41.780 --> 00:03:43.880 production network. And sometimes, you 00:03:43.880 --> 00:03:45.860 have to do this after hours when nobody 00:03:45.860 --> 00:03:48.260 else is working on the network. You want 00:03:48.260 --> 00:03:49.640 to be able to implement this with a 00:03:49.640 --> 00:03:52.250 minimum amount of impact to production 00:03:52.250 --> 00:03:54.380 traffic. So often, you'll have to do this 00:03:54.380 --> 00:03:57.410 after hours. A best practice is to 00:03:57.410 --> 00:03:59.510 document the exact steps that will be 00:03:59.510 --> 00:04:00.830 required to solve this particular 00:04:00.830 --> 00:04:03.560 problem. If it's replacing a cable, then 00:04:03.560 --> 00:04:04.970 the process will be relatively 00:04:04.970 --> 00:04:06.680 straightforward. But if you're upgrading 00:04:06.680 --> 00:04:08.989 software in a switch, a router, or a 00:04:08.989 --> 00:04:11.750 firewall, there may be additional tasks 00:04:11.750 --> 00:04:13.579 involved in performing this plan of 00:04:13.579 --> 00:04:15.500 action. You'll also want some 00:04:15.500 --> 00:04:17.540 alternatives if your plan doesn't go as 00:04:17.540 --> 00:04:19.700 designed. For example, you may run into 00:04:19.700 --> 00:04:21.590 problems when upgrading the software in 00:04:21.590 --> 00:04:23.599 a firewall. So, you may need an additional 00:04:23.599 --> 00:04:26.150 firewall or a way to roll back to the 00:04:26.150 --> 00:04:27.250 previous version. 00:04:27.250 --> 00:04:28.580 Now that you've 00:04:28.580 --> 00:04:30.229 documented your plan of action, you can 00:04:30.229 --> 00:04:31.729 take that to your change control team, 00:04:31.729 --> 00:04:33.770 and they can give you a window when you 00:04:33.770 --> 00:04:35.900 can implement that change. The actual 00:04:35.900 --> 00:04:38.240 fixing of the issue is probably going to 00:04:38.240 --> 00:04:40.769 be during off hours, during non-production 00:04:40.769 --> 00:04:41.990 times, and you may need to 00:04:41.990 --> 00:04:43.729 bring in other people to assist, 00:04:43.729 --> 00:04:46.539 especially if your window is very small. 00:04:46.539 --> 00:04:48.860 Once you have executed on your plan of 00:04:48.860 --> 00:04:50.719 action, your job isn't done yet. 00:04:50.719 --> 00:04:52.520 We need to make sure that all of these 00:04:52.520 --> 00:04:55.460 changes actually resolve the problem. So, 00:04:55.460 --> 00:04:56.750 now that the changes have been 00:04:56.750 --> 00:04:59.000 implemented, we now need to perform some 00:04:59.000 --> 00:05:01.190 tests. We may want to bring in the end 00:05:01.190 --> 00:05:03.139 users who first experienced this problem 00:05:03.139 --> 00:05:05.719 so that they can run through exactly the 00:05:05.719 --> 00:05:07.789 same scenario to tell you if the problem 00:05:07.789 --> 00:05:10.530 is resolved or if the problem still exists. 00:05:10.530 --> 00:05:12.379 This might also be a good time to 00:05:12.379 --> 00:05:14.240 implement some preventive measures. That 00:05:14.240 --> 00:05:16.460 way, we can either be informed that the 00:05:16.460 --> 00:05:18.430 problem is occurring, or we can provide 00:05:18.430 --> 00:05:20.569 alternatives that we can implement if 00:05:20.569 --> 00:05:23.719 that problem happens again. After the 00:05:23.719 --> 00:05:25.430 problem has been resolved, this is a 00:05:25.430 --> 00:05:27.439 perfect time to document the entire 00:05:27.439 --> 00:05:29.750 process from the very beginning to the 00:05:29.750 --> 00:05:31.639 very end. You'll, of course, want to 00:05:31.639 --> 00:05:33.379 provide as much information as possible. 00:05:33.379 --> 00:05:35.779 So, if somebody runs into this issue 00:05:35.779 --> 00:05:37.940 again, they can simply search your 00:05:37.940 --> 00:05:40.159 knowledge base, find that particular error 00:05:40.159 --> 00:05:42.050 that popped up, and know exactly the 00:05:42.050 --> 00:05:44.529 process you used to solve this last time. 00:05:44.529 --> 00:05:47.539 Many organizations have a help desk with 00:05:47.539 --> 00:05:49.250 case notes that they can reference, or 00:05:49.250 --> 00:05:50.930 you might have a separate knowledge base 00:05:50.930 --> 00:05:53.000 or wiki that you create where you're 00:05:53.000 --> 00:05:54.440 storing all of this important 00:05:54.440 --> 00:05:57.349 information for the future. A document 00:05:57.349 --> 00:05:58.969 that was created a number of years ago 00:05:58.969 --> 00:06:00.919 but still shows the importance of 00:06:00.919 --> 00:06:02.900 keeping this documentation over time is 00:06:02.900 --> 00:06:04.430 from Google Research, where they 00:06:04.430 --> 00:06:06.830 documented the failure trends in a large 00:06:06.830 --> 00:06:09.229 disk drive population. And because they 00:06:09.229 --> 00:06:11.449 were keeping extensive data over a long 00:06:11.449 --> 00:06:13.789 period of time, they were able to tell 00:06:13.789 --> 00:06:16.250 when a drive was starting to fail based 00:06:16.250 --> 00:06:17.990 on the types of errors that they were 00:06:17.990 --> 00:06:20.360 receiving. Being able to store all of 00:06:20.360 --> 00:06:22.039 this important information, being 00:06:22.039 --> 00:06:23.990 able to go back in time to see what 00:06:23.990 --> 00:06:26.270 happened, becomes a very important part 00:06:26.270 --> 00:06:28.629 of maintaining a network for the future. 00:06:28.629 --> 00:06:31.159 Let's summarize this troubleshooting 00:06:31.159 --> 00:06:33.050 methodology. We start with gathering as 00:06:33.050 --> 00:06:35.449 much information as possible, asking 00:06:35.449 --> 00:06:37.190 users about what they're seeing, and 00:06:37.190 --> 00:06:39.500 documenting any specific error messages. 00:06:39.500 --> 00:06:41.330 Then, we want to be able to create a 00:06:41.330 --> 00:06:42.050 number of 00:06:42.050 --> 00:06:43.789 theories that might solve this particular 00:06:43.789 --> 00:06:46.759 problem. And once we have this list, we 00:06:46.759 --> 00:06:48.379 want to be able to put it in the lab and 00:06:48.379 --> 00:06:50.389 try testing each one of these theories 00:06:50.389 --> 00:06:52.340 until we find the one that actually 00:06:52.340 --> 00:06:55.009 resolves the issue. From there, we can 00:06:55.009 --> 00:06:57.169 create a plan of action and document any 00:06:57.169 --> 00:06:59.599 possible problems that might occur. We 00:06:59.599 --> 00:07:01.280 can then get a time to implement the 00:07:01.280 --> 00:07:03.530 issue and put it into our production 00:07:03.530 --> 00:07:05.659 environment. And then we can verify and 00:07:05.659 --> 00:07:07.789 test and make sure that the entire 00:07:07.789 --> 00:07:10.729 system is now working as expected. And, of 00:07:10.729 --> 00:07:12.199 course, finally, we want to document 00:07:12.199 --> 00:07:14.780 everything that we did from the very 00:07:14.780 --> 00:07:16.729 beginning of our troubleshooting process 00:07:16.729 --> 00:07:19.387 all the way through to the end.