1 00:00:01,199 --> 00:00:03,760 Hello everyone, my name is Victor. I'm 2 00:00:03,760 --> 00:00:05,359 your friendly neighborhood data 3 00:00:05,359 --> 00:00:07,759 scientist from DreamCatcher. So in this 4 00:00:07,759 --> 00:00:10,160 presentation, I would like to talk about 5 00:00:10,160 --> 00:00:12,759 a specific industry use case of AI or 6 00:00:12,759 --> 00:00:15,069 machine learning which is predictive 7 00:00:15,069 --> 00:00:19,000 maintenance. So I will be covering these 8 00:00:19,000 --> 00:00:21,320 topics and feel free to jump forward to 9 00:00:21,320 --> 00:00:23,359 the specific part in the video where I 10 00:00:23,359 --> 00:00:25,160 talk about all these topics. So I'm going 11 00:00:25,160 --> 00:00:27,160 to start off with a general preview of 12 00:00:27,160 --> 00:00:29,080 AI and machine learning. Then, I'll 13 00:00:29,080 --> 00:00:30,840 discuss the use case which is predictive 14 00:00:30,840 --> 00:00:32,719 maintenance. I'll talk about the basics 15 00:00:32,719 --> 00:00:34,800 of machine learning, the workflow of 16 00:00:34,800 --> 00:00:37,239 machine learning, and then we will come 17 00:00:37,239 --> 00:00:40,760 to the meat of this presentation which 18 00:00:40,760 --> 00:00:43,680 is essentially a demonstration of the 19 00:00:43,680 --> 00:00:45,399 machine learning workflow from end to 20 00:00:45,399 --> 00:00:47,579 end on a real life predictive 21 00:00:47,579 --> 00:00:51,520 maintenance domain problem. All right, so 22 00:00:51,520 --> 00:00:53,640 without any further ado, let's jump into 23 00:00:53,640 --> 00:00:56,680 it. So let's start off with a quick 24 00:00:56,680 --> 00:01:00,079 preview of AI and machine learning. Well 25 00:01:00,079 --> 00:01:03,600 AI is a very general term, it encompasses 26 00:01:03,600 --> 00:01:06,680 the entire area of science and 27 00:01:06,680 --> 00:01:09,040 engineering that is related to creating 28 00:01:09,040 --> 00:01:10,840 software programs and machines that 29 00:01:10,840 --> 00:01:13,759 will be capable of performing tasks 30 00:01:13,759 --> 00:01:16,080 that would normally require human 31 00:01:16,080 --> 00:01:19,600 intelligence. But AI is a catchall term, 32 00:01:19,600 --> 00:01:22,920 so really when we talk about apply AI, 33 00:01:22,920 --> 00:01:25,920 how we use AI in our daily work, we are 34 00:01:25,920 --> 00:01:27,720 really going to be talking about machine 35 00:01:27,720 --> 00:01:30,000 learning. So machine learning is the 36 00:01:30,000 --> 00:01:31,680 design and application of software 37 00:01:31,680 --> 00:01:34,079 algorithms that are capable of learning 38 00:01:34,079 --> 00:01:37,960 on their own without any explicit human 39 00:01:37,960 --> 00:01:40,399 intervention. And the primary purpose of 40 00:01:40,399 --> 00:01:43,280 these algorithms are to optimize 41 00:01:43,280 --> 00:01:46,840 performance in a specific task. And the 42 00:01:46,840 --> 00:01:49,680 primary performance or the primary task 43 00:01:49,680 --> 00:01:52,000 that you want to optimize performance in 44 00:01:52,000 --> 00:01:54,240 is to be able to make accurate 45 00:01:54,240 --> 00:01:57,479 predictions about future outcomes based 46 00:01:57,479 --> 00:02:00,560 on the analysis of historical data 47 00:02:00,560 --> 00:02:02,960 from the past. So essentially machine 48 00:02:02,960 --> 00:02:05,320 learning is about making predictions 49 00:02:05,320 --> 00:02:06,880 about the future or what we call 50 00:02:06,880 --> 00:02:08,919 predictive analytics. 51 00:02:08,919 --> 00:02:11,000 And there are many different 52 00:02:11,000 --> 00:02:12,720 kinds of algorithms that are available in 53 00:02:12,720 --> 00:02:14,519 machine learning under the three primary 54 00:02:14,519 --> 00:02:16,440 categories of supervised learning, 55 00:02:16,440 --> 00:02:18,920 unsupervised learning, and reinforcement 56 00:02:18,920 --> 00:02:21,440 learning. And here we can see some of the 57 00:02:21,440 --> 00:02:23,560 different kinds of algorithms and their 58 00:02:23,560 --> 00:02:27,480 use cases in various areas in 59 00:02:27,480 --> 00:02:29,680 industry. So we have various domain use 60 00:02:29,680 --> 00:02:30,480 cases 61 00:02:30,480 --> 00:02:31,800 for all these different kind of 62 00:02:31,800 --> 00:02:33,840 algorithms, and we can see that different 63 00:02:33,840 --> 00:02:38,120 algorithms are fitted for different use cases. 64 00:02:38,120 --> 00:02:41,000 Deep learning is an advanced form 65 00:02:41,000 --> 00:02:42,400 of machine learning that's based on 66 00:02:42,400 --> 00:02:44,280 something called an artificial neural 67 00:02:44,280 --> 00:02:46,319 network or ANN for short, and this 68 00:02:46,319 --> 00:02:47,840 essentially simulates the structure of 69 00:02:47,840 --> 00:02:49,519 the human brain whereby neurons 70 00:02:49,519 --> 00:02:51,360 interconnect and work together to 71 00:02:51,360 --> 00:02:54,959 process and learn new information. So DL 72 00:02:54,959 --> 00:02:57,239 is the foundational technology for most 73 00:02:57,239 --> 00:02:59,360 of the popular AI tools that you 74 00:02:59,360 --> 00:03:01,400 probably have heard of today. So I'm sure 75 00:03:01,400 --> 00:03:03,200 you have heard of ChatGPT if you haven't 76 00:03:03,200 --> 00:03:05,360 been living in a cave for the past 2 77 00:03:05,360 --> 00:03:08,280 years. And yeah, so ChatGPT is an example 78 00:03:08,280 --> 00:03:10,120 of what we call a large language model 79 00:03:10,120 --> 00:03:11,599 and that's based on this technology 80 00:03:11,599 --> 00:03:14,879 called deep learning. Also, all the modern 81 00:03:14,879 --> 00:03:17,440 computer vision applications where a 82 00:03:17,440 --> 00:03:20,040 computer program can classify images or 83 00:03:20,040 --> 00:03:23,239 detect images or recognize images on 84 00:03:23,239 --> 00:03:25,280 its own, okay, we call this computer 85 00:03:25,280 --> 00:03:27,760 vision applications. They also use 86 00:03:27,760 --> 00:03:29,519 this particular form of machine learning 87 00:03:29,519 --> 00:03:31,560 called deep learning, right? So this is a 88 00:03:31,560 --> 00:03:33,640 example of an artificial neural network. 89 00:03:33,640 --> 00:03:35,200 For example, here I have an image of a 90 00:03:35,200 --> 00:03:37,159 bird that's fed into this artificial 91 00:03:37,159 --> 00:03:39,560 neural network, and output from this 92 00:03:39,560 --> 00:03:41,239 artificial neural network is a 93 00:03:41,239 --> 00:03:43,959 classification of this image into one of 94 00:03:43,959 --> 00:03:46,400 these three potential categories. So in 95 00:03:46,400 --> 00:03:49,080 this case, if the ANN has been trained 96 00:03:49,080 --> 00:03:51,799 properly, we fit in this image, this 97 00:03:51,799 --> 00:03:54,079 ANN should correctly classify this image 98 00:03:54,079 --> 00:03:56,879 as a bird, right? So this is a image 99 00:03:56,879 --> 00:03:58,959 classification problem which is a 100 00:03:58,959 --> 00:04:01,079 classic use case for an artificial 101 00:04:01,079 --> 00:04:03,929 neural network in the field of computer 102 00:04:03,929 --> 00:04:07,879 vision. And just like in the case of 103 00:04:07,879 --> 00:04:09,400 machine learning, there are a variety of 104 00:04:09,400 --> 00:04:11,640 algorithms that are available for 105 00:04:11,640 --> 00:04:13,599 deep learning under the category of 106 00:04:13,599 --> 00:04:15,000 supervised learning and also 107 00:04:15,000 --> 00:04:16,839 unsupervised learning. 108 00:04:16,839 --> 00:04:19,199 All right, so this is how we can 109 00:04:19,199 --> 00:04:20,839 kind of categorize this. You can think of 110 00:04:20,839 --> 00:04:23,880 AI is a general area of smart systems 111 00:04:23,880 --> 00:04:26,560 and machine. Machine learning is 112 00:04:26,560 --> 00:04:29,360 basically apply AI and deep learning 113 00:04:29,360 --> 00:04:29,823 is a 114 00:04:29,823 --> 00:04:32,560 subspecialization of machine learning 115 00:04:32,560 --> 00:04:35,000 using a particular architecture called 116 00:04:35,000 --> 00:04:38,760 an artificial neural network. 117 00:04:38,760 --> 00:04:42,160 And generative AI, so if you talk 118 00:04:42,160 --> 00:04:45,280 about ChatGPT, okay, Google Gemini, 119 00:04:45,280 --> 00:04:47,639 Microsoft Copilot, okay, all these 120 00:04:47,639 --> 00:04:49,600 examples of generative AI, they are 121 00:04:49,600 --> 00:04:51,600 basically large language models, and they 122 00:04:51,600 --> 00:04:53,880 are a further subcategory within the 123 00:04:53,880 --> 00:04:55,170 area of deep 124 00:04:55,170 --> 00:04:57,759 learning. And there are many applications 125 00:04:57,759 --> 00:04:59,400 of machine learning in industry right 126 00:04:59,400 --> 00:05:01,759 now, so pick which particular industry 127 00:05:01,759 --> 00:05:03,680 are you involved in, and these are all the 128 00:05:03,680 --> 00:05:05,060 specific areas of 129 00:05:05,060 --> 00:05:09,960 applications, right? So probably, I'm 130 00:05:09,960 --> 00:05:11,680 going to guess the vast majority of you 131 00:05:11,680 --> 00:05:12,880 who are watching this video, you're 132 00:05:12,880 --> 00:05:14,360 probably coming from the manufacturing 133 00:05:14,360 --> 00:05:16,639 industry, and so in the manufacturing 134 00:05:16,639 --> 00:05:18,479 industry some of the standard use cases 135 00:05:18,479 --> 00:05:20,039 for machine learning and deep learning 136 00:05:20,039 --> 00:05:23,080 are predicting potential problems, okay? 137 00:05:23,080 --> 00:05:25,319 So sometimes you call this predictive 138 00:05:25,319 --> 00:05:27,160 maintenance where you want to predict 139 00:05:27,160 --> 00:05:28,800 when a problem is going to happen and 140 00:05:28,800 --> 00:05:30,400 then kind of address it before it 141 00:05:30,400 --> 00:05:32,759 happens. And then monitoring systems, 142 00:05:32,759 --> 00:05:35,199 automating your manufacturing assembly 143 00:05:35,199 --> 00:05:37,880 line or production line, okay, smart 144 00:05:37,880 --> 00:05:40,120 scheduling, and detecting anomaly on your 145 00:05:40,120 --> 00:05:41,480 production line. 146 00:05:42,390 --> 00:05:44,160 Okay, so let's talk about the use 147 00:05:44,160 --> 00:05:45,680 case here which is predictive 148 00:05:45,680 --> 00:05:49,280 maintenance, right? So what is predictive 149 00:05:49,280 --> 00:05:51,720 maintenance? Well predictive maintenance, 150 00:05:51,720 --> 00:05:53,199 here's the long definition, is a 151 00:05:53,199 --> 00:05:54,639 equipment maintenance strategy that 152 00:05:54,639 --> 00:05:56,280 relies on real-time monitoring of 153 00:05:56,280 --> 00:05:58,360 equipment conditions and data to predict 154 00:05:58,360 --> 00:06:00,280 equipment failures in advance. 155 00:06:00,280 --> 00:06:02,680 And this uses advanced data models, 156 00:06:02,680 --> 00:06:05,240 analytics, and machine learning whereby 157 00:06:05,240 --> 00:06:07,479 we can reliably assess when failures are 158 00:06:07,479 --> 00:06:09,199 more likely to occur, including which 159 00:06:09,199 --> 00:06:11,120 components are more likely to be 160 00:06:11,120 --> 00:06:13,560 affected on your production or assembly 161 00:06:13,560 --> 00:06:16,599 line. So where does predictive 162 00:06:16,599 --> 00:06:18,759 maintenance fit into the overall scheme 163 00:06:18,759 --> 00:06:20,759 of things, right? So let's talk about the 164 00:06:20,759 --> 00:06:23,039 kind of standard way that, you know, 165 00:06:23,039 --> 00:06:25,520 factories or production 166 00:06:25,520 --> 00:06:27,680 lines, assembly lines in factories tend 167 00:06:27,680 --> 00:06:31,080 to handle maintenance issues say 168 00:06:31,080 --> 00:06:33,120 10 or 20 years ago, right? So what you 169 00:06:33,120 --> 00:06:34,520 have is the, what you would probably 170 00:06:34,520 --> 00:06:36,400 start off is the most basic mode 171 00:06:36,400 --> 00:06:38,240 which is reactive maintenance. So you 172 00:06:38,240 --> 00:06:40,680 just wait until your machine breaks down 173 00:06:40,680 --> 00:06:43,039 and then you repair, right? The simplest, 174 00:06:43,039 --> 00:06:44,720 but, of course, I'm sure if you have worked on a 175 00:06:44,720 --> 00:06:46,720 production line for any period of time, 176 00:06:46,720 --> 00:06:48,880 you know that this reactive maintenance 177 00:06:48,880 --> 00:06:50,759 can give you a whole bunch of headaches 178 00:06:50,759 --> 00:06:52,160 especially if the machine breaks down 179 00:06:52,160 --> 00:06:54,120 just before a critical delivery deadline, 180 00:06:54,120 --> 00:06:55,520 right? Then you're going to have a 181 00:06:55,520 --> 00:06:56,800 backlog of orders and you're going to 182 00:06:56,800 --> 00:06:59,160 run to a lot of problems. Okay, so we move on 183 00:06:59,160 --> 00:07:00,879 to preventive maintenance which is 184 00:07:00,879 --> 00:07:03,840 you regularly schedule a maintenance of 185 00:07:03,840 --> 00:07:07,000 your production machines to reduce 186 00:07:07,000 --> 00:07:08,800 the failure rate. So you might do 187 00:07:08,800 --> 00:07:10,520 maintenance once every month, once every 188 00:07:10,520 --> 00:07:13,120 two weeks, whatever. Okay, this is great, 189 00:07:13,120 --> 00:07:15,240 but the problem, of course, then is well 190 00:07:15,240 --> 00:07:16,199 sometimes you're doing too much 191 00:07:16,199 --> 00:07:18,400 maintenance, it's not really necessary, 192 00:07:18,400 --> 00:07:20,639 and it still doesn't totally prevent 193 00:07:20,639 --> 00:07:23,240 this, you know, a failure of the 194 00:07:23,240 --> 00:07:25,639 machine that occurs outside of your planned 195 00:07:25,639 --> 00:07:28,680 maintenance, right? So a bit of an 196 00:07:28,680 --> 00:07:31,160 improvement, but not that much better. 197 00:07:31,160 --> 00:07:33,280 And then, these last two categories is 198 00:07:33,280 --> 00:07:34,680 where we bring in AI and machine 199 00:07:34,680 --> 00:07:36,759 learning. So with machine learning, we're 200 00:07:36,759 --> 00:07:39,280 going to use sensors to do real-time 201 00:07:39,280 --> 00:07:41,759 monitoring of the data, and then using 202 00:07:41,759 --> 00:07:43,319 that data we're going to build a machine 203 00:07:43,319 --> 00:07:46,479 learning model which helps us to predict, 204 00:07:46,479 --> 00:07:50,000 with a reasonable level of accuracy, when 205 00:07:50,000 --> 00:07:52,520 the next failure is going to happen on 206 00:07:52,520 --> 00:07:54,440 your assembly or production line on a 207 00:07:54,440 --> 00:07:57,440 specific component or specific machine, 208 00:07:57,440 --> 00:07:59,520 right? So you just want to be predict to 209 00:07:59,520 --> 00:08:01,960 a high level of accuracy like maybe 210 00:08:01,960 --> 00:08:04,440 to the specific day, even the specific 211 00:08:04,440 --> 00:08:06,400 hour, or even minute itself when you 212 00:08:06,400 --> 00:08:08,360 expect that particular product to fail 213 00:08:08,360 --> 00:08:10,960 or the particular machine to fail. All 214 00:08:10,960 --> 00:08:12,639 right, so these are the advantages of 215 00:08:12,639 --> 00:08:14,879 predictive maintenance. It minimizes 216 00:08:14,879 --> 00:08:16,720 the occurrence of unscheduled downtime, it 217 00:08:16,720 --> 00:08:18,080 gives you a real-time overview of your 218 00:08:18,080 --> 00:08:19,919 current condition of assets, ensures 219 00:08:19,919 --> 00:08:22,680 minimal disruptions to productivity, 220 00:08:22,680 --> 00:08:24,720 optimizes time you spend on maintenance work, 221 00:08:24,720 --> 00:08:26,639 optimizes the use of spare parts, and so 222 00:08:26,639 --> 00:08:28,280 on. And of course there are some 223 00:08:28,280 --> 00:08:30,639 disadvantages, which is the 224 00:08:30,639 --> 00:08:32,559 primary one, you need a specialized set 225 00:08:32,559 --> 00:08:35,519 of skills among your engineers to 226 00:08:35,519 --> 00:08:37,719 understand and create machine learning 227 00:08:37,719 --> 00:08:40,599 models that can work on the real-time 228 00:08:40,599 --> 00:08:43,559 data that you're getting. Okay, so we're 229 00:08:43,559 --> 00:08:45,000 going to take a look at some real life 230 00:08:45,000 --> 00:08:47,200 use cases. So these are a bunch of links 231 00:08:47,200 --> 00:08:48,720 here, so if you navigate to these links 232 00:08:48,720 --> 00:08:50,120 here, you'll be able to get a look at 233 00:08:50,120 --> 00:08:54,360 some real life use cases of machine 234 00:08:54,360 --> 00:08:57,640 learning in predictive maintenance. So 235 00:08:57,640 --> 00:09:00,959 the IBM website, okay, gives you a look at 236 00:09:00,959 --> 00:09:04,880 a bunch of five use cases, so you can 237 00:09:04,880 --> 00:09:06,519 click on these links and follow up with 238 00:09:06,519 --> 00:09:08,279 them if you want to read more. Okay, this 239 00:09:08,279 --> 00:09:11,480 is waste management, manufacturing, okay, 240 00:09:11,480 --> 00:09:14,760 building services, and renewable energy, 241 00:09:14,760 --> 00:09:16,880 and also mining, right? So these are all 242 00:09:16,880 --> 00:09:18,279 use cases, if you want to know more about 243 00:09:18,279 --> 00:09:20,480 them, you can read up and follow them 244 00:09:20,480 --> 00:09:23,600 from this website. And this website 245 00:09:23,600 --> 00:09:25,760 gives, this is a pretty good website. I 246 00:09:25,760 --> 00:09:27,720 would really encourage you to just look 247 00:09:27,720 --> 00:09:28,880 through this if you're interested in 248 00:09:28,880 --> 00:09:31,160 predictive maintenance. So here, it tells 249 00:09:31,160 --> 00:09:34,279 you about, you know, an industry survey of 250 00:09:34,279 --> 00:09:36,360 predictive maintenance. We can see that a 251 00:09:36,360 --> 00:09:38,200 large portion of the industry, 252 00:09:38,200 --> 00:09:39,680 manufacturing industry agreed that 253 00:09:39,680 --> 00:09:41,360 predictive maintenance is a real need to 254 00:09:41,360 --> 00:09:43,959 stay competitive and predictive 255 00:09:43,959 --> 00:09:45,240 maintenance is essential for 256 00:09:45,240 --> 00:09:46,720 manufacturing industry and will gain 257 00:09:46,720 --> 00:09:48,279 additional strength in the future. So 258 00:09:48,279 --> 00:09:50,200 this is a survey that was done quite 259 00:09:50,200 --> 00:09:52,040 some time ago and this was the results 260 00:09:52,040 --> 00:09:53,880 that we got back. So we can see the vast 261 00:09:53,880 --> 00:09:55,720 majority of key industry players in the 262 00:09:55,720 --> 00:09:57,640 manufacturing sector, they consider 263 00:09:57,640 --> 00:09:59,000 predictive maintenance to be a very 264 00:09:59,000 --> 00:09:59,839 important 265 00:09:59,839 --> 00:10:01,600 activity that they want to 266 00:10:01,600 --> 00:10:04,519 incorporate into their workflow, right? 267 00:10:04,519 --> 00:10:07,720 And we can see here the kind of ROI that 268 00:10:07,720 --> 00:10:10,680 we expect on investment in predictive 269 00:10:10,680 --> 00:10:13,399 maintenance, so 45% reduction in downtime, 270 00:10:13,399 --> 00:10:17,120 25% growth in productivity, 75% fault 271 00:10:17,120 --> 00:10:19,480 elimination, 30% reduction in maintenance 272 00:10:19,480 --> 00:10:22,640 cost, okay? And best of all, if you really 273 00:10:22,640 --> 00:10:25,040 want to kind of take a look at examples, 274 00:10:25,040 --> 00:10:26,680 all right, so there are all these 275 00:10:26,680 --> 00:10:28,120 different companies that have 276 00:10:28,120 --> 00:10:30,160 significantly invested in predictive 277 00:10:30,160 --> 00:10:31,640 maintenance technology in their 278 00:10:31,640 --> 00:10:34,240 manufacturing processes. So PepsiCo, we 279 00:10:34,240 --> 00:10:38,965 have got Frito-Lay, General Motors, Mondi, Ecoplant, 280 00:10:38,965 --> 00:10:40,959 all right? So you can jump over here 281 00:10:40,959 --> 00:10:42,959 and take a look at some of these 282 00:10:42,959 --> 00:10:46,040 use cases. Let me perhaps, let me try and 283 00:10:46,040 --> 00:10:48,079 open this up, for example, Mondi, right? You 284 00:10:48,079 --> 00:10:51,880 can see Mondi has impl- oops. Mondi has used 285 00:10:51,880 --> 00:10:53,720 this particular piece of software 286 00:10:53,720 --> 00:10:55,839 called MATLAB, all right, or MathWorks 287 00:10:55,839 --> 00:10:59,760 sorry, to do predictive maintenance 288 00:10:59,760 --> 00:11:01,920 for their manufacturing processes using 289 00:11:01,920 --> 00:11:05,079 machine learning. And we can talk, you can 290 00:11:05,079 --> 00:11:07,680 study how they have used it, all right, 291 00:11:07,680 --> 00:11:09,000 and how it works, what was their 292 00:11:09,000 --> 00:11:10,920 challenge, all right, the problems they 293 00:11:10,920 --> 00:11:12,639 were facing, the solution that they use 294 00:11:12,639 --> 00:11:14,560 using this MathWorks Consulting piece of 295 00:11:14,560 --> 00:11:17,160 software, and data that they collected in 296 00:11:17,160 --> 00:11:20,399 a MATLAB database, all right, sorry 297 00:11:20,399 --> 00:11:23,639 in a Oracle database. 298 00:11:23,639 --> 00:11:26,399 So using MathWorks from MATLAB, all 299 00:11:26,399 --> 00:11:27,959 right, they were able to create a deep 300 00:11:27,959 --> 00:11:30,560 learning model to, you know, to 301 00:11:30,560 --> 00:11:32,839 solve this particular issue for their 302 00:11:32,839 --> 00:11:35,720 domain. So if you're interested, please, I 303 00:11:35,720 --> 00:11:37,639 strongly encourage you to read up on all 304 00:11:37,639 --> 00:11:40,440 these real life customer stories with 305 00:11:40,440 --> 00:11:43,403 showcase use cases for predictive 306 00:11:43,403 --> 00:11:48,240 maintenance. Okay, so that's it for 307 00:11:48,240 --> 00:11:52,200 real life use cases for predictive maintenance. 308 00:11:53,819 --> 00:11:56,600 Now in this topic, I'm 309 00:11:56,600 --> 00:11:58,000 going to talk about machine learning 310 00:11:58,000 --> 00:12:00,040 basics, so what is actually involved 311 00:12:00,040 --> 00:12:01,480 in machine learning, and I'm going to 312 00:12:01,480 --> 00:12:03,839 give a very quick, fast, conceptual, high 313 00:12:03,839 --> 00:12:05,920 level overview of machine learning, all 314 00:12:05,920 --> 00:12:09,000 right? So there are several categories of 315 00:12:09,000 --> 00:12:10,959 machine learning, supervised, unsupervised, 316 00:12:10,959 --> 00:12:13,000 semi-supervised, reinforcement, and deep 317 00:12:13,000 --> 00:12:15,880 learning, okay? And let's talk about the 318 00:12:15,880 --> 00:12:19,360 most common and widely used category of 319 00:12:19,360 --> 00:12:20,560 machine learning which is called 320 00:12:20,560 --> 00:12:25,040 supervised learning. So the particular use 321 00:12:25,040 --> 00:12:26,279 case here that I'm going to be 322 00:12:26,279 --> 00:12:28,560 discussing, predictive maintenance, it's 323 00:12:28,560 --> 00:12:31,320 basically a form of supervised learning. 324 00:12:31,320 --> 00:12:33,480 So how does supervised learning work? 325 00:12:33,480 --> 00:12:35,199 Well in supervised learning, you're going 326 00:12:35,199 --> 00:12:37,240 to create a machine learning model by 327 00:12:37,240 --> 00:12:39,360 providing what is called a labelled data 328 00:12:39,360 --> 00:12:41,680 set as a input to a machine learning 329 00:12:41,680 --> 00:12:44,680 program or algorithm. And this dataset 330 00:12:44,680 --> 00:12:46,440 is going to contain what is called an 331 00:12:46,440 --> 00:12:48,760 independent or feature variables, all 332 00:12:48,760 --> 00:12:51,240 right, so this will be a set of variables. 333 00:12:51,240 --> 00:12:52,959 And there will be one dependent or 334 00:12:52,959 --> 00:12:54,959 target variable which we also call the 335 00:12:54,959 --> 00:12:57,720 label, and the idea is that the 336 00:12:57,720 --> 00:12:59,839 independent or the feature variables are 337 00:12:59,839 --> 00:13:01,600 the attributes or properties of your 338 00:13:01,600 --> 00:13:04,160 data set that influence the dependent or 339 00:13:04,160 --> 00:13:07,760 the target variable, okay? So this process 340 00:13:07,760 --> 00:13:09,120 that I've just described is called 341 00:13:09,120 --> 00:13:11,600 training the machine learning model, and 342 00:13:11,600 --> 00:13:14,279 the model is fundamentally a 343 00:13:14,279 --> 00:13:16,399 mathematical function that best 344 00:13:16,399 --> 00:13:18,399 approximates the relationship between 345 00:13:18,399 --> 00:13:20,639 the independent variables and the 346 00:13:20,639 --> 00:13:22,639 dependent variable. All right, so that's 347 00:13:22,639 --> 00:13:24,480 quite a bit of a mouthful, so let's jump 348 00:13:24,480 --> 00:13:26,320 into a diagram that maybe illustrates 349 00:13:26,320 --> 00:13:27,880 this more clearly. So let's say you have 350 00:13:27,880 --> 00:13:30,000 a dataset here, an Excel spreadsheet, 351 00:13:30,000 --> 00:13:32,160 right? And this Excel spreadsheet has a 352 00:13:32,160 --> 00:13:34,040 bunch of columns here and a bunch of 353 00:13:34,040 --> 00:13:36,800 rows, okay? So these rows here represent 354 00:13:36,800 --> 00:13:39,000 observations, or these rows are what 355 00:13:39,000 --> 00:13:40,959 we call observations or samples or data 356 00:13:40,959 --> 00:13:43,120 points in our data set, okay? So let's 357 00:13:43,120 --> 00:13:46,880 assume this data set is gathered by a 358 00:13:46,880 --> 00:13:49,959 marketing manager at a mall, at a retail 359 00:13:49,959 --> 00:13:52,279 mall, all right? So they've got all this 360 00:13:52,279 --> 00:13:54,920 information about the customers who 361 00:13:54,920 --> 00:13:56,800 purchase products at this mall, all right? 362 00:13:56,800 --> 00:13:58,519 So some of the information they've 363 00:13:58,519 --> 00:14:00,000 gotten about the customers are their 364 00:14:00,000 --> 00:14:01,839 gender, their age, their income, and the 365 00:14:01,839 --> 00:14:03,600 number of children. So all this 366 00:14:03,600 --> 00:14:05,680 information about the customers, we call 367 00:14:05,680 --> 00:14:07,360 this the independent or the feature 368 00:14:07,360 --> 00:14:10,079 variables, all right? And based on all 369 00:14:10,079 --> 00:14:12,759 this information about the customer, we 370 00:14:12,759 --> 00:14:16,199 also managed to get some or we record 371 00:14:16,199 --> 00:14:17,600 the information about how much the 372 00:14:17,600 --> 00:14:20,480 customer spends, all right? So this 373 00:14:20,480 --> 00:14:22,079 information or these numbers here, we call 374 00:14:22,079 --> 00:14:23,839 this the target variable or the 375 00:14:23,839 --> 00:14:26,600 dependent variable, right? So on the 376 00:14:26,600 --> 00:14:29,519 single row, the data point, one single sample, one 377 00:14:29,519 --> 00:14:32,560 single data point, contains all the data 378 00:14:32,560 --> 00:14:35,040 for the feature variables and one single 379 00:14:35,040 --> 00:14:37,800 value for the label or the target 380 00:14:37,800 --> 00:14:41,199 variable, okay? And the primary purpose of 381 00:14:41,199 --> 00:14:43,240 the machine learning model is to create 382 00:14:43,240 --> 00:14:45,519 a mapping from all your feature 383 00:14:45,519 --> 00:14:48,160 variables to your target variable, so 384 00:14:48,160 --> 00:14:50,920 somehow there's going to be a function, 385 00:14:50,920 --> 00:14:52,160 okay, this will be a mathematical 386 00:14:52,160 --> 00:14:54,800 function that maps all the values of 387 00:14:54,800 --> 00:14:57,040 your feature variable to the value of 388 00:14:57,040 --> 00:14:59,639 your target variable. In other words, this 389 00:14:59,639 --> 00:15:01,279 function represents the relationship 390 00:15:01,279 --> 00:15:03,360 between your feature variables and your 391 00:15:03,360 --> 00:15:07,079 target variable, okay? So this whole thing, 392 00:15:07,079 --> 00:15:08,560 this training process, we call this the 393 00:15:08,560 --> 00:15:11,320 fitting the model. And the target 394 00:15:11,320 --> 00:15:13,240 variable or the label, this thing here, 395 00:15:13,240 --> 00:15:15,120 this column here, or the values here, 396 00:15:15,120 --> 00:15:17,399 these are critical for providing a 397 00:15:17,399 --> 00:15:19,000 context to do the fitting or the 398 00:15:19,000 --> 00:15:21,160 training of the model. And once you've 399 00:15:21,160 --> 00:15:23,360 got a trained and fitted model, you can 400 00:15:23,360 --> 00:15:25,959 then use the model to make an accurate 401 00:15:25,959 --> 00:15:28,319 prediction of target values 402 00:15:28,319 --> 00:15:30,240 corresponding to new feature values that 403 00:15:30,240 --> 00:15:32,519 the model has yet to encounter or yet to 404 00:15:32,519 --> 00:15:34,759 see, and this, as I've already said 405 00:15:34,759 --> 00:15:36,240 earlier, this is called predictive 406 00:15:36,240 --> 00:15:38,480 analytics, okay? So let's see what's 407 00:15:38,480 --> 00:15:40,120 actually happening here, you take your 408 00:15:40,120 --> 00:15:43,079 training data, all right, so this is this 409 00:15:43,079 --> 00:15:44,880 whole bunch of data, this data set here 410 00:15:44,880 --> 00:15:47,440 consisting of a thousand rows of 411 00:15:47,440 --> 00:15:49,920 data, 10,000 rows of data, you take this 412 00:15:49,920 --> 00:15:52,040 entire data set, all right, this entire 413 00:15:52,040 --> 00:15:54,000 data set, you jam it into your machine 414 00:15:54,000 --> 00:15:56,519 learning algorithm, and a couple of hours 415 00:15:56,519 --> 00:15:58,079 later your machine learning algorithm 416 00:15:58,079 --> 00:16:01,360 comes up with a model. And the model is 417 00:16:01,360 --> 00:16:04,199 essentially a function that maps all 418 00:16:04,199 --> 00:16:05,959 your feature variables which is these 419 00:16:05,959 --> 00:16:08,199 four columns here, to your target 420 00:16:08,199 --> 00:16:10,440 variable which is this one single column 421 00:16:10,440 --> 00:16:14,279 here, okay? So once you have the model, you 422 00:16:14,279 --> 00:16:17,040 can put in a new data point. So basically 423 00:16:17,040 --> 00:16:19,079 the new data point represents data about a 424 00:16:19,079 --> 00:16:20,959 new customer, a new customer that you 425 00:16:20,959 --> 00:16:23,120 have never seen before. So let's say 426 00:16:23,120 --> 00:16:25,079 you've already got information about 427 00:16:25,079 --> 00:16:27,560 10,000 customers that have visited this 428 00:16:27,560 --> 00:16:29,920 mall and how much each of these 10,000 429 00:16:29,920 --> 00:16:31,519 customers have spent when they are at this 430 00:16:31,519 --> 00:16:34,040 mall. So now you have a totally new 431 00:16:34,040 --> 00:16:35,800 customer that comes in the mall, this 432 00:16:35,800 --> 00:16:37,800 customer has never come into this mall 433 00:16:37,800 --> 00:16:39,839 before, and what we know about this 434 00:16:39,839 --> 00:16:42,680 customer is that he is a male, the age is 435 00:16:42,680 --> 00:16:45,199 50, the income is 18, and they have nine 436 00:16:45,199 --> 00:16:48,160 children. So now when you take this data 437 00:16:48,160 --> 00:16:50,519 and you pump that into your model, your 438 00:16:50,519 --> 00:16:52,920 model is going to make a prediction, it's 439 00:16:52,920 --> 00:16:55,720 going to say, hey, you know what? Based on 440 00:16:55,720 --> 00:16:57,279 everything that I have been trained before 441 00:16:57,279 --> 00:16:59,360 and based on the model I've developed, 442 00:16:59,360 --> 00:17:01,959 I am going to predict that a customer 443 00:17:01,959 --> 00:17:04,880 that is of a male gender, of the age 50 444 00:17:04,880 --> 00:17:08,280 with the income of 18, and nine children, 445 00:17:08,280 --> 00:17:12,400 that customer is going to spend 25 ringgit 446 00:17:12,400 --> 00:17:15,839 at the mall. And this is it, this is what 447 00:17:15,839 --> 00:17:18,599 you want. Right there, right here, 448 00:17:18,599 --> 00:17:21,319 can you see here? That is the final 449 00:17:21,319 --> 00:17:23,480 output of your machine learning model. 450 00:17:23,480 --> 00:17:27,359 It's going to make a prediction about 451 00:17:27,359 --> 00:17:29,760 something that it has not ever seen 452 00:17:29,760 --> 00:17:32,919 before, okay? That is the core, this is 453 00:17:32,919 --> 00:17:35,520 essentially the core of machine learning. 454 00:17:35,520 --> 00:17:38,640 Predictive analytics, making prediction 455 00:17:38,640 --> 00:17:40,120 about the future 456 00:17:41,170 --> 00:17:43,799 based on a historical data set. 457 00:17:44,379 --> 00:17:47,440 Okay, so there are two areas of 458 00:17:47,440 --> 00:17:49,480 supervised learning, regression and 459 00:17:49,480 --> 00:17:51,400 classification. So regression is used to 460 00:17:51,400 --> 00:17:53,440 predict a numerical target variable, such 461 00:17:53,440 --> 00:17:55,320 as the price of a house or the salary of 462 00:17:55,320 --> 00:17:57,799 an employee, whereas classification is 463 00:17:57,799 --> 00:17:59,919 used to predict a categorical target 464 00:17:59,919 --> 00:18:03,559 variable or class label, okay? So for 465 00:18:03,559 --> 00:18:05,799 classification you can have either 466 00:18:05,799 --> 00:18:08,679 binary or multiclass, so, for example, 467 00:18:08,679 --> 00:18:11,559 binary will be just true or false, zero 468 00:18:11,559 --> 00:18:14,840 or one. So whether your machine is going 469 00:18:14,840 --> 00:18:17,360 to fail or is it not going to fail, right? 470 00:18:17,360 --> 00:18:19,000 So just two classes, two possible, 471 00:18:19,000 --> 00:18:21,640 outcomes, or is the customer going to 472 00:18:21,640 --> 00:18:23,679 make a purchase or is the customer not 473 00:18:23,679 --> 00:18:26,159 going to make a purchase. We call this 474 00:18:26,159 --> 00:18:28,120 binary classification. And then for 475 00:18:28,120 --> 00:18:29,679 multiclass, when there are more than two 476 00:18:29,679 --> 00:18:32,559 classes or types of values. So, for 477 00:18:32,559 --> 00:18:34,039 example, here this would be a 478 00:18:34,039 --> 00:18:35,760 classification problem. So if you have a 479 00:18:35,760 --> 00:18:37,960 data set here, you've got information 480 00:18:37,960 --> 00:18:39,360 about your customers, you've got your 481 00:18:39,360 --> 00:18:41,159 gender of the customer, the age of the 482 00:18:41,159 --> 00:18:42,919 customer, the salary of the customer, and 483 00:18:42,919 --> 00:18:44,640 you also have record about whether the 484 00:18:44,640 --> 00:18:47,679 customer made a purchase or not, okay? So 485 00:18:47,679 --> 00:18:50,080 you can take this data set to train a 486 00:18:50,080 --> 00:18:52,440 classification model, and then the 487 00:18:52,440 --> 00:18:54,120 classification model can then make a 488 00:18:54,120 --> 00:18:56,320 prediction about a new customer, and 489 00:18:56,320 --> 00:18:58,799 they're going to predict zero which 490 00:18:58,799 --> 00:19:00,480 means the customer didn't make a 491 00:19:00,480 --> 00:19:03,159 purchase or one which means the customer 492 00:19:03,159 --> 00:19:06,320 make a purchase, right? And regression, 493 00:19:06,320 --> 00:19:08,600 this is regression, so let's say you want 494 00:19:08,600 --> 00:19:11,280 to predict the wind speed, and you've got 495 00:19:11,280 --> 00:19:13,799 historical data about all these four 496 00:19:13,799 --> 00:19:16,559 other independent variables or feature 497 00:19:16,559 --> 00:19:18,039 variables, so you have recorded 498 00:19:18,039 --> 00:19:19,640 temperature, the pressure, the relative 499 00:19:19,640 --> 00:19:21,799 humidity, and the wind direction for the 500 00:19:21,799 --> 00:19:24,799 past 10 days, 15 days, or whatever, okay? So 501 00:19:24,799 --> 00:19:26,760 now you are going to train your machine 502 00:19:26,760 --> 00:19:28,720 learning model using this data set, and 503 00:19:28,720 --> 00:19:31,679 the target variable column, okay, this 504 00:19:31,679 --> 00:19:33,760 column here, the label is basically a 505 00:19:33,760 --> 00:19:37,080 number, right? So now with this number, 506 00:19:37,080 --> 00:19:39,600 this is a regression model, and so now 507 00:19:39,600 --> 00:19:41,760 you can put in a new data point, so a new 508 00:19:41,760 --> 00:19:45,080 data point means a new set of values for 509 00:19:45,080 --> 00:19:46,960 temperature, pressure, relative humidity, 510 00:19:46,960 --> 00:19:48,600 and wind direction, and your machine 511 00:19:48,600 --> 00:19:50,679 learning model will then predict the 512 00:19:50,679 --> 00:19:53,640 wind speed for that new data point, okay? 513 00:19:53,640 --> 00:19:57,480 So that's a regression model. 514 00:19:59,159 --> 00:20:02,280 All right. So in this particular topic 515 00:20:02,280 --> 00:20:04,919 I'm going to talk about the workflow of 516 00:20:04,919 --> 00:20:07,960 that's involved in machine learning. So 517 00:20:07,960 --> 00:20:12,640 in the previous slides, I talked about 518 00:20:12,640 --> 00:20:14,600 developing the model, all right? But 519 00:20:14,600 --> 00:20:16,360 that's just one part of the entire 520 00:20:16,360 --> 00:20:19,080 workflow. So in real life when you use 521 00:20:19,080 --> 00:20:20,480 machine learning, there's an end-to-end 522 00:20:20,480 --> 00:20:22,480 workflow that's involved. So the first 523 00:20:22,480 --> 00:20:24,159 thing, of course, is you need to get your 524 00:20:24,159 --> 00:20:26,880 data, and then you need to clean your 525 00:20:26,880 --> 00:20:29,000 data, and then you need to explore your 526 00:20:29,000 --> 00:20:30,799 data. You need to see what's going on in 527 00:20:30,799 --> 00:20:33,280 your data set, right? And your data set, 528 00:20:33,280 --> 00:20:35,720 real life data sets are not trivial, they 529 00:20:35,720 --> 00:20:38,760 are hundreds of rows, thousands of rows, 530 00:20:38,760 --> 00:20:40,640 sometimes millions of rows, billions of 531 00:20:40,640 --> 00:20:43,080 rows, we're talking about billions or 532 00:20:43,080 --> 00:20:45,120 millions of data points especially if 533 00:20:45,120 --> 00:20:47,120 you're using an IoT sensor to get data 534 00:20:47,120 --> 00:20:49,000 in real time. So you've got all these 535 00:20:49,000 --> 00:20:51,320 super large data sets, you need to clean 536 00:20:51,320 --> 00:20:53,400 them, and explore them, and then you need 537 00:20:53,400 --> 00:20:56,360 to prepare them into a right format so 538 00:20:56,360 --> 00:20:59,600 that you can put them into the training 539 00:20:59,600 --> 00:21:01,520 process to create your machine learning 540 00:21:01,520 --> 00:21:04,799 model, and then subsequently you check 541 00:21:04,799 --> 00:21:07,559 how good is the model, right? How accurate 542 00:21:07,559 --> 00:21:10,080 is the model in terms of its ability to 543 00:21:10,080 --> 00:21:12,559 generate predictions for the 544 00:21:12,559 --> 00:21:14,960 future, right? How accurate are the 545 00:21:14,960 --> 00:21:16,679 predictions that are coming up from your 546 00:21:16,679 --> 00:21:18,400 machine learning model. So that's 547 00:21:18,400 --> 00:21:20,760 validating or evaluating your model, and 548 00:21:20,760 --> 00:21:22,559 then subsequently if you determine that 549 00:21:22,559 --> 00:21:25,400 your model is of adequate accuracy to 550 00:21:25,400 --> 00:21:27,240 meet whatever your domain use case 551 00:21:27,240 --> 00:21:29,400 requirements are, right? So let's say the 552 00:21:29,400 --> 00:21:31,440 accuracy that's required for your domain 553 00:21:31,440 --> 00:21:32,440 use case is 554 00:21:32,440 --> 00:21:35,320 85%, okay? If my machine learning model 555 00:21:35,320 --> 00:21:38,520 can give an 85% accuracy rate, I think 556 00:21:38,520 --> 00:21:40,159 it's good enough, then I'm going to 557 00:21:40,159 --> 00:21:42,880 deploy it into real world use case. So 558 00:21:42,880 --> 00:21:45,000 here the machine learning model gets 559 00:21:45,000 --> 00:21:48,440 deployed on the server, and then other, 560 00:21:48,440 --> 00:21:50,760 you know, other data sources are going to 561 00:21:50,760 --> 00:21:52,559 be captured from somewhere. That data is 562 00:21:52,559 --> 00:21:54,200 pump into the machine learning model. The 563 00:21:54,200 --> 00:21:55,440 machine learning model generates 564 00:21:55,440 --> 00:21:57,760 predictions, and those predictions are 565 00:21:57,760 --> 00:21:59,600 then used to make decisions on the 566 00:21:59,600 --> 00:22:02,000 factory floor in real time or in any 567 00:22:02,000 --> 00:22:04,559 other particular scenario. And then you 568 00:22:04,559 --> 00:22:06,840 constantly monitor and update the model, 569 00:22:06,840 --> 00:22:09,360 you get more new data, and then the 570 00:22:09,360 --> 00:22:11,960 entire cycle repeats itself. So that's 571 00:22:11,960 --> 00:22:14,480 your machine learning workflow, okay, in a 572 00:22:14,480 --> 00:22:16,919 nutshell. Here's another example of 573 00:22:16,919 --> 00:22:18,520 the same thing maybe in a slightly 574 00:22:18,520 --> 00:22:20,039 different format, so, again, you have your 575 00:22:20,039 --> 00:22:22,159 data collection and preparation. Here we 576 00:22:22,159 --> 00:22:24,360 talk more about the different kinds of 577 00:22:24,360 --> 00:22:26,520 algorithms that available to create a 578 00:22:26,520 --> 00:22:28,120 model, and I'll talk about this more in 579 00:22:28,120 --> 00:22:30,000 detail when we look at the real world 580 00:22:30,000 --> 00:22:32,320 example of a end-to-end machine learning 581 00:22:32,320 --> 00:22:34,559 workflow for the predictive maintenance 582 00:22:34,559 --> 00:22:36,880 use case. So once you have chosen the 583 00:22:36,880 --> 00:22:38,840 appropriate algorithm, you then have 584 00:22:38,840 --> 00:22:41,240 trained your model, you then have 585 00:22:41,240 --> 00:22:44,080 selected the appropriate train model 586 00:22:44,080 --> 00:22:46,440 among the multiple models. You are 587 00:22:46,440 --> 00:22:47,799 probably going to develop multiple 588 00:22:47,799 --> 00:22:49,559 models from multiple algorithms, you're 589 00:22:49,559 --> 00:22:51,679 going to evaluate them all, and then 590 00:22:51,679 --> 00:22:53,200 you're going to say, hey, you know what? 591 00:22:53,200 --> 00:22:55,279 After I've evaluated and tested that, 592 00:22:55,279 --> 00:22:57,480 I've chosen the best model, I'm going to 593 00:22:57,480 --> 00:22:59,640 deploy the model, all right, so this is 594 00:22:59,640 --> 00:23:02,640 for real life production use, okay? Real 595 00:23:02,640 --> 00:23:04,279 life sensor data is going to be pumped 596 00:23:04,279 --> 00:23:06,039 into my model, my model is going to 597 00:23:06,039 --> 00:23:08,039 generate predictions, the predicted data 598 00:23:08,039 --> 00:23:10,120 is going to used immediately in real 599 00:23:10,120 --> 00:23:12,840 time for real life decision making, and 600 00:23:12,840 --> 00:23:15,000 then I'm going to monitor, right, the 601 00:23:15,000 --> 00:23:17,440 results. So somebody's using the 602 00:23:17,440 --> 00:23:19,279 predictions from my model, if the 603 00:23:19,279 --> 00:23:21,880 predictions are lousy, that goes into the 604 00:23:21,880 --> 00:23:23,440 monitoring, the monitoring system 605 00:23:23,440 --> 00:23:25,279 captures that. If the predictions are 606 00:23:25,279 --> 00:23:27,720 fantastic, well that is also captured by the 607 00:23:27,720 --> 00:23:29,799 monitoring system, and that gets 608 00:23:29,799 --> 00:23:32,360 feedback again to the next cycle of my 609 00:23:32,360 --> 00:23:33,679 machine learning 610 00:23:33,679 --> 00:23:35,960 pipeline. Okay, so that's the kind of 611 00:23:35,960 --> 00:23:38,360 overall view, and here are the kind of 612 00:23:38,360 --> 00:23:41,559 key phases of your workflow. So one of 613 00:23:41,559 --> 00:23:43,960 the important phases is called EDA, 614 00:23:43,960 --> 00:23:47,520 exploratory data analysis and in this 615 00:23:47,520 --> 00:23:49,880 particular phase, you're going to 616 00:23:49,880 --> 00:23:53,120 do a lot of stuff, primarily just to 617 00:23:53,120 --> 00:23:54,880 understand your data set. So like I said, 618 00:23:54,880 --> 00:23:56,559 real life data sets, they tend to be very 619 00:23:56,559 --> 00:23:59,320 complex, and they tend to have various 620 00:23:59,320 --> 00:24:01,039 statistical properties, all right, 621 00:24:01,039 --> 00:24:02,679 statistics is a very important component 622 00:24:02,679 --> 00:24:05,600 of machine learning. So an EDA helps you 623 00:24:05,600 --> 00:24:07,480 to kind of get an overview of your data 624 00:24:07,480 --> 00:24:09,679 set, get an overview of any problems in 625 00:24:09,679 --> 00:24:11,520 your data set like any data that's 626 00:24:11,520 --> 00:24:13,440 missing, the statistical properties of your 627 00:24:13,440 --> 00:24:15,159 data set, the distribution of your data 628 00:24:15,159 --> 00:24:17,279 set, the statistical correlation of 629 00:24:17,279 --> 00:24:19,190 variables in your data set, etc, 630 00:24:19,190 --> 00:24:23,400 etc. Okay, then we have data cleaning or 631 00:24:23,400 --> 00:24:25,279 sometimes you call it data cleansing, and 632 00:24:25,279 --> 00:24:27,600 in this phase what you want to do is 633 00:24:27,600 --> 00:24:29,440 primarily, you want to kind of do things 634 00:24:29,440 --> 00:24:31,960 like remove duplicate records or rows in 635 00:24:31,960 --> 00:24:33,679 your table, you want to make sure that 636 00:24:33,679 --> 00:24:36,799 your data or your data 637 00:24:36,799 --> 00:24:39,399 points or your samples have appropriate IDs, 638 00:24:39,399 --> 00:24:41,080 and most importantly, you want to make 639 00:24:41,080 --> 00:24:43,039 sure there's not too many missing values 640 00:24:43,039 --> 00:24:44,880 in your data set. So what I mean by 641 00:24:44,880 --> 00:24:46,320 missing values are things like that, 642 00:24:46,320 --> 00:24:48,200 right? You have got a data set, and for 643 00:24:48,200 --> 00:24:51,640 some reason there are some cells or 644 00:24:51,640 --> 00:24:54,559 locations in your data set which are 645 00:24:54,559 --> 00:24:56,520 missing values, right? And if you have a 646 00:24:56,520 --> 00:24:58,679 lot of these missing values, then you've 647 00:24:58,679 --> 00:25:00,440 got a poor quality data set, and you're 648 00:25:00,440 --> 00:25:02,200 not going to be able to build a good 649 00:25:02,200 --> 00:25:04,159 model from this data set. You're not 650 00:25:04,159 --> 00:25:06,000 going to be able to train a good machine 651 00:25:06,000 --> 00:25:08,120 learning model from a data set with a 652 00:25:08,120 --> 00:25:10,200 lot of missing values like this. So you 653 00:25:10,200 --> 00:25:11,880 have to figure out whether there are a 654 00:25:11,880 --> 00:25:13,399 lot of missing values in your data set, 655 00:25:13,399 --> 00:25:15,399 how do you handle them. Another thing 656 00:25:15,399 --> 00:25:16,919 that's important in data cleansing is 657 00:25:16,919 --> 00:25:18,799 figuring out the outliers in your data 658 00:25:18,799 --> 00:25:21,919 set. So outliers are things like this, 659 00:25:21,919 --> 00:25:24,039 you know, data points that are very far from 660 00:25:24,039 --> 00:25:26,440 the general trend of data points in your 661 00:25:26,440 --> 00:25:29,559 data set, right? And so there are also 662 00:25:29,559 --> 00:25:31,919 several ways to detect outliers in your 663 00:25:31,919 --> 00:25:34,200 data set, and there are several ways to 664 00:25:34,200 --> 00:25:36,640 handle outliers in your data set. 665 00:25:36,640 --> 00:25:38,200 Similarly as well, there are several ways 666 00:25:38,200 --> 00:25:39,960 to handle missing values in your data 667 00:25:39,960 --> 00:25:42,880 set. So handling missing values, handling 668 00:25:42,880 --> 00:25:45,679 outliers, those are really two very key 669 00:25:45,679 --> 00:25:47,279 importance of data 670 00:25:47,279 --> 00:25:49,120 cleansing, and there are many, many 671 00:25:49,120 --> 00:25:50,760 techniques to handle this, so a data 672 00:25:50,760 --> 00:25:52,000 scientist needs to be acquainted with 673 00:25:52,000 --> 00:25:55,360 all of this. All right, why do I need to 674 00:25:55,360 --> 00:25:58,000 do data cleansing? Well, here is the key 675 00:25:58,000 --> 00:25:59,360 point. 676 00:25:59,360 --> 00:26:02,799 If you have a very poor quality data set, 677 00:26:02,799 --> 00:26:04,880 which means you've got a lot of outliers 678 00:26:04,880 --> 00:26:06,720 which are errors in your data set, or you 679 00:26:06,720 --> 00:26:08,159 got a lot of missing values in your data 680 00:26:08,159 --> 00:26:10,840 set, even though you've got a fantastic 681 00:26:10,840 --> 00:26:13,039 algorithm, you've got a fantastic model, 682 00:26:13,039 --> 00:26:15,720 the predictions that your model is going 683 00:26:15,720 --> 00:26:18,960 to give is absolutely rubbish. It's kind 684 00:26:18,960 --> 00:26:22,080 of like taking water and putting water 685 00:26:22,080 --> 00:26:26,000 into the tank of a Mercedes-Benz. So 686 00:26:26,000 --> 00:26:28,440 Mercedes-Benz is a great car, but if you 687 00:26:28,440 --> 00:26:30,080 take water and put it into your 688 00:26:30,080 --> 00:26:33,399 Mercedes-Benz, it will just die, right? Your 689 00:26:33,399 --> 00:26:36,520 car will just die, it can't run on water, 690 00:26:36,520 --> 00:26:38,279 right? On the other hand, if you have a 691 00:26:38,279 --> 00:26:41,559 Myvi, Myvi is just a lousy, shit car, but if 692 00:26:41,559 --> 00:26:44,840 you take a high octane, good petrol and 693 00:26:44,840 --> 00:26:47,240 you put into a Myvi, the Myvi will just go at, 694 00:26:47,240 --> 00:26:49,480 you know, 100 miles an hour. It would just 695 00:26:49,480 --> 00:26:51,159 completely destroy the Mercedes-Benz in 696 00:26:51,159 --> 00:26:53,360 terms of performance, so it 697 00:26:53,360 --> 00:26:54,799 doesn't really matter what model you're 698 00:26:54,799 --> 00:26:57,080 using here, right? So you can be using the most 699 00:26:57,080 --> 00:26:58,679 fantastic model like the 700 00:26:58,679 --> 00:27:01,200 Mercedes-Benz or machine learning, but if 701 00:27:01,200 --> 00:27:03,080 your data is lousy quality, your 702 00:27:03,080 --> 00:27:06,480 predictions is also going to be rubbish, 703 00:27:06,480 --> 00:27:10,000 okay? So cleansing data set is, in fact, 704 00:27:10,000 --> 00:27:11,880 probably the most important thing that 705 00:27:11,880 --> 00:27:13,640 data scientists need to do and that's 706 00:27:13,640 --> 00:27:15,520 what they spend most of the time doing, 707 00:27:15,520 --> 00:27:17,600 right, building the model, training the 708 00:27:17,600 --> 00:27:20,240 model, getting the right algorithms, and 709 00:27:20,240 --> 00:27:23,240 so on, that's really a small portion of 710 00:27:23,240 --> 00:27:25,200 the actual machine learning workflow, 711 00:27:25,200 --> 00:27:27,360 right? The actual machine learning 712 00:27:27,360 --> 00:27:29,679 workflow, the vast majority of time is on 713 00:27:29,679 --> 00:27:31,559 cleaning and organizing your 714 00:27:31,559 --> 00:27:33,360 data. Then you have something called 715 00:27:33,360 --> 00:27:35,080 feature engineering which is you 716 00:27:35,080 --> 00:27:37,000 preprocess the feature variables of 717 00:27:37,000 --> 00:27:38,919 your original data set prior to using 718 00:27:38,919 --> 00:27:40,600 them to train the model, and this is 719 00:27:40,600 --> 00:27:41,960 either through addition, deletion, 720 00:27:41,960 --> 00:27:43,600 combination, or transformation of these 721 00:27:43,600 --> 00:27:45,399 variables. And then the idea is you want 722 00:27:45,399 --> 00:27:47,000 to improve the predictive accuracy of 723 00:27:47,000 --> 00:27:49,320 the model, and also because some models 724 00:27:49,320 --> 00:27:51,080 can only work with numeric data, so you 725 00:27:51,080 --> 00:27:53,720 need to transform categorical data into 726 00:27:53,720 --> 00:27:57,039 numeric data. All right, so just now, in 727 00:27:57,039 --> 00:27:58,799 the earlier slides, I showed you that you 728 00:27:58,799 --> 00:28:00,760 take your original data set, you pump it 729 00:28:00,760 --> 00:28:03,200 into algorithm, and then a couple of hours 730 00:28:03,200 --> 00:28:05,200 later, you get a machine learning model, 731 00:28:05,200 --> 00:28:08,640 right? So you didn't do anything to your 732 00:28:08,640 --> 00:28:10,159 data set, to the feature variables in 733 00:28:10,159 --> 00:28:12,159 your data set before you pump it into a 734 00:28:12,159 --> 00:28:14,399 machine learning algorithm. So 735 00:28:14,399 --> 00:28:15,840 what I showed you earlier is you just 736 00:28:15,840 --> 00:28:18,919 take the data set exactly as it is and 737 00:28:18,919 --> 00:28:20,799 you just pump it into the algorithm, 738 00:28:20,799 --> 00:28:23,120 couple of hours later, you get a model, 739 00:28:23,120 --> 00:28:27,640 right? But that's not what generally 740 00:28:27,640 --> 00:28:29,600 happens in in real life. In real life, 741 00:28:29,600 --> 00:28:31,559 you're going to take all the original 742 00:28:31,559 --> 00:28:34,320 feature variables from your data set and 743 00:28:34,320 --> 00:28:36,720 you're going to transform them in some 744 00:28:36,720 --> 00:28:38,960 way. So you can see here these are the 745 00:28:38,960 --> 00:28:42,120 columns of data from my original data set, 746 00:28:42,120 --> 00:28:46,039 and before I actually put all these data 747 00:28:46,039 --> 00:28:48,240 points from my original data set into my 748 00:28:48,240 --> 00:28:50,720 algorithm to train and get my model, I 749 00:28:50,720 --> 00:28:54,960 will actually transform them, okay? So the 750 00:28:54,960 --> 00:28:57,600 transformation of these feature variable 751 00:28:57,600 --> 00:29:00,600 values, we call this feature engineering. 752 00:29:00,600 --> 00:29:02,440 And there are many, many techniques to do 753 00:29:02,440 --> 00:29:04,960 feature engineering, so one-hot encoding, 754 00:29:04,960 --> 00:29:08,279 scaling, log transformation, 755 00:29:08,279 --> 00:29:10,480 discretization, date extraction, boolean 756 00:29:10,480 --> 00:29:12,039 logic, etc, etc. 757 00:29:12,039 --> 00:29:14,880 Okay, then finally we do something 758 00:29:14,880 --> 00:29:16,799 called a train-test split, so where we 759 00:29:16,799 --> 00:29:19,440 take our original dataset, right? So this 760 00:29:19,440 --> 00:29:21,360 was the original dataset, and we break 761 00:29:21,360 --> 00:29:23,720 it into two parts, so one is called the 762 00:29:23,720 --> 00:29:25,760 training dataset and the other is 763 00:29:25,760 --> 00:29:28,120 called the test dataset. And the primary 764 00:29:28,120 --> 00:29:30,000 purpose for this is when we feed and 765 00:29:30,000 --> 00:29:31,399 train the machine learning model, we're 766 00:29:31,399 --> 00:29:32,640 going to use what is called the training 767 00:29:32,640 --> 00:29:35,559 dataset, and when we want to evaluate 768 00:29:35,559 --> 00:29:37,399 the accuracy of the model, right? So this 769 00:29:37,399 --> 00:29:40,960 is the key part of your machine learning 770 00:29:40,960 --> 00:29:43,640 life cycle because you are not only just 771 00:29:43,640 --> 00:29:45,440 going to have one possible models 772 00:29:45,440 --> 00:29:47,720 because there are a vast range of 773 00:29:47,720 --> 00:29:50,080 algorithms that you can use to create a 774 00:29:50,080 --> 00:29:53,000 model. So fundamentally you have a wide 775 00:29:53,000 --> 00:29:55,679 range of choices, right, like wide range 776 00:29:55,679 --> 00:29:57,640 of cars, right? You want to buy a car, you 777 00:29:57,640 --> 00:30:00,559 can buy a Myvi, you can buy a Perodua, 778 00:30:00,559 --> 00:30:02,640 you can buy a Honda, you can buy a 779 00:30:02,640 --> 00:30:05,039 Mercedes-Benz, you can buy a Audi, you can 780 00:30:05,039 --> 00:30:07,760 buy a beamer, many, many different cars 781 00:30:07,760 --> 00:30:09,240 that available for you if you want 782 00:30:09,240 --> 00:30:11,679 to buy a car, right? Same thing. With a 783 00:30:11,679 --> 00:30:14,360 machine learning model there are a vast 784 00:30:14,360 --> 00:30:16,720 variety of algorithms that you can 785 00:30:16,720 --> 00:30:19,480 choose from in order to create a model, 786 00:30:19,480 --> 00:30:21,519 and so once you create a model from a 787 00:30:21,519 --> 00:30:24,480 given algorithm you need to say, hey, how 788 00:30:24,480 --> 00:30:26,440 accurate is this model that I've created 789 00:30:26,440 --> 00:30:28,640 from this algorithm. And different 790 00:30:28,640 --> 00:30:30,399 algorithms are going to create different 791 00:30:30,399 --> 00:30:33,720 models with different rates of accuracy. 792 00:30:33,720 --> 00:30:35,679 And so the primary purpose of the test 793 00:30:35,679 --> 00:30:38,200 dataset is to evaluate the accuracy 794 00:30:38,200 --> 00:30:41,480 of the model to see hey, is this model 795 00:30:41,480 --> 00:30:43,360 that I've created using this algorithm, 796 00:30:43,360 --> 00:30:45,880 is it adequate for me to use in a real 797 00:30:45,880 --> 00:30:48,600 life production use case? Okay? So that's 798 00:30:48,600 --> 00:30:52,320 what it's all about. Okay, so this is my 799 00:30:52,320 --> 00:30:54,279 original dataset, I break it into my 800 00:30:54,279 --> 00:30:56,559 feature dataset and 801 00:30:56,559 --> 00:30:58,519 also my target variable column, so my 802 00:30:58,519 --> 00:31:00,639 feature variable columns, the target 803 00:31:00,639 --> 00:31:02,200 variable columns, and then I further break 804 00:31:02,200 --> 00:31:04,240 it into a training dataset and a test 805 00:31:04,240 --> 00:31:06,600 dataset. The training dataset is to use 806 00:31:06,600 --> 00:31:08,320 to train, to create the machine learning 807 00:31:08,320 --> 00:31:10,480 model. And then once the machine learning 808 00:31:10,480 --> 00:31:12,200 model is created, I then use the test 809 00:31:12,200 --> 00:31:15,080 dataset to evaluate the accuracy of the 810 00:31:15,080 --> 00:31:17,259 machine learning model. 811 00:31:17,259 --> 00:31:21,000 All right. And then finally we can 812 00:31:21,000 --> 00:31:23,200 see what are the different parts or 813 00:31:23,200 --> 00:31:26,080 aspects that go into a successful model, 814 00:31:26,080 --> 00:31:29,519 so EDA about 10%, data cleansing about 815 00:31:29,519 --> 00:31:32,360 20%, feature engineering about 816 00:31:32,360 --> 00:31:36,320 25%, selecting a specific algorithm about 817 00:31:36,320 --> 00:31:39,120 10%, and then training the model from 818 00:31:39,120 --> 00:31:41,639 that algorithm about 15%, and then 819 00:31:41,639 --> 00:31:43,679 finally evaluating the model, deciding 820 00:31:43,679 --> 00:31:45,960 which is the best model with the highest 821 00:31:45,960 --> 00:31:51,819 accuracy rate, that's about 20%. 822 00:31:54,080 --> 00:31:56,919 All right, so we have reached the 823 00:31:56,919 --> 00:31:58,880 most interesting part of this 824 00:31:58,880 --> 00:32:01,039 presentation which is the demonstration 825 00:32:01,039 --> 00:32:03,760 of an end-to-end machine learning workflow 826 00:32:03,760 --> 00:32:06,080 on a real life dataset that 827 00:32:06,080 --> 00:32:10,080 demonstrates the use case of predictive 828 00:32:10,080 --> 00:32:13,519 maintenance. So for the data set for 829 00:32:13,519 --> 00:32:16,240 this particular use case, I've used a 830 00:32:16,240 --> 00:32:19,200 data set from Kaggle. So for those of you 831 00:32:19,200 --> 00:32:21,399 are not aware of this, Kaggle is the 832 00:32:21,399 --> 00:32:24,880 world's largest open-source community 833 00:32:24,880 --> 00:32:28,080 for data science and AI, and they have a 834 00:32:28,080 --> 00:32:31,159 large collection of datasets from all 835 00:32:31,159 --> 00:32:34,440 various areas of industry and human 836 00:32:34,440 --> 00:32:37,039 endeavor, and they also have a large 837 00:32:37,039 --> 00:32:38,840 collection of models that have been 838 00:32:38,840 --> 00:32:42,880 developed using these data sets. So here 839 00:32:42,880 --> 00:32:47,039 we have a data set for the particular 840 00:32:47,039 --> 00:32:50,519 use case, predictive maintenance, okay? So 841 00:32:50,519 --> 00:32:52,919 this is some information about the data 842 00:32:52,919 --> 00:32:56,440 set, so in case you do not know how 843 00:32:56,440 --> 00:32:59,200 to get to there, this is the URL to click 844 00:32:59,200 --> 00:33:02,240 on, okay, to get to that dataset. So once 845 00:33:02,240 --> 00:33:05,120 your at the data set here, you can- or the 846 00:33:05,120 --> 00:33:07,399 page for about this dataset, you can see 847 00:33:07,399 --> 00:33:09,960 all the information about this data set, 848 00:33:09,960 --> 00:33:12,959 and you can download the data set in a 849 00:33:12,959 --> 00:33:14,159 CSV format. 850 00:33:14,159 --> 00:33:16,360 Okay, so let's take a look at the 851 00:33:16,360 --> 00:33:19,559 dataset. So this dataset has a total of 852 00:33:19,559 --> 00:33:23,440 10,000 samples, okay? And these are the 853 00:33:23,440 --> 00:33:26,279 feature variables, the type, the product 854 00:33:26,279 --> 00:33:28,440 ID, the air temperature, process 855 00:33:28,440 --> 00:33:30,899 temperature, rotational speed, torque, tool 856 00:33:30,899 --> 00:33:34,799 wear, and this is the target variable, 857 00:33:34,799 --> 00:33:36,720 all right? So the target variable is what 858 00:33:36,720 --> 00:33:38,159 we are interested in, what we are 859 00:33:38,159 --> 00:33:40,960 interested in using to train the machine 860 00:33:40,960 --> 00:33:42,600 learning model, and also what we are 861 00:33:42,600 --> 00:33:45,279 interested to predict, okay? So these are 862 00:33:45,279 --> 00:33:47,960 the feature variables, they describe or 863 00:33:47,960 --> 00:33:49,960 they provide information about this 864 00:33:49,960 --> 00:33:52,880 particular machine on the production 865 00:33:52,880 --> 00:33:55,080 line, on the assembly line, so you might 866 00:33:55,080 --> 00:33:56,799 know the product ID, the type, the air 867 00:33:56,799 --> 00:33:58,120 temperature, process temperature, 868 00:33:58,120 --> 00:34:00,480 rotational speed, torque, tool wear, right? So 869 00:34:00,480 --> 00:34:03,159 let's say you've got a IoT sensor system 870 00:34:03,159 --> 00:34:06,120 that's basically capturing all this data 871 00:34:06,120 --> 00:34:08,359 about a product or a machine on your 872 00:34:08,359 --> 00:34:10,679 production or assembly line, okay? And 873 00:34:10,679 --> 00:34:13,918 you've also captured information about 874 00:34:13,918 --> 00:34:17,199 whether is for a specific sample, 875 00:34:17,199 --> 00:34:19,839 whether that sample experience a 876 00:34:19,839 --> 00:34:23,040 failure or not, okay? So the target value 877 00:34:23,040 --> 00:34:25,520 of zero, okay, indicates that there's no 878 00:34:25,520 --> 00:34:28,000 failure. So zero means no failure, and we 879 00:34:28,000 --> 00:34:30,199 can see that the vast majority of data 880 00:34:30,199 --> 00:34:32,520 points in this data set are no failure. 881 00:34:32,520 --> 00:34:34,000 And here we can see an example here 882 00:34:34,000 --> 00:34:36,719 where you have a case of a failure, so a 883 00:34:36,719 --> 00:34:40,159 failure is marked as a one, positive, and 884 00:34:40,159 --> 00:34:42,639 no failure is marked as zero, negative, 885 00:34:42,639 --> 00:34:44,879 all right? So here we have one type of a 886 00:34:44,879 --> 00:34:47,040 failure, it's called a power failure. And 887 00:34:47,040 --> 00:34:49,000 if you scroll down the data set, you see 888 00:34:49,000 --> 00:34:50,399 there are also other kinds of failures 889 00:34:50,399 --> 00:34:52,839 like a tool wear 890 00:34:52,839 --> 00:34:56,960 failure, we have a overstrain failure 891 00:34:56,960 --> 00:34:58,680 here, for example, 892 00:34:58,680 --> 00:35:00,760 we also have a power failure again, 893 00:35:00,760 --> 00:35:02,200 and so on. So if you scroll down through 894 00:35:02,200 --> 00:35:04,160 these 10,000 data points, or if 895 00:35:04,160 --> 00:35:06,040 you're familiar with using Excel to 896 00:35:06,040 --> 00:35:08,839 filter out values in a column, you can 897 00:35:08,839 --> 00:35:12,280 see that in this particular column here 898 00:35:12,280 --> 00:35:14,480 which is the so-called target variable 899 00:35:14,480 --> 00:35:16,960 column, you are going to have the vast 900 00:35:16,960 --> 00:35:18,920 majority of values as zero which means 901 00:35:18,920 --> 00:35:22,760 no failure, and some of the rows or the 902 00:35:22,760 --> 00:35:24,040 data points you are going to have a 903 00:35:24,040 --> 00:35:26,359 value of one, and for those rows that you 904 00:35:26,359 --> 00:35:28,119 have a value of one, for example, 905 00:35:28,119 --> 00:35:31,280 here you are- Sorry, for example, here you 906 00:35:31,280 --> 00:35:32,839 are going to have different types of 907 00:35:32,839 --> 00:35:34,640 failures, so like I said just now power 908 00:35:34,640 --> 00:35:38,960 failure, tool set failure, etc, etc. So we are 909 00:35:38,960 --> 00:35:40,640 going to go through the entire machine 910 00:35:40,640 --> 00:35:43,759 learning workflow process with this dataset. 911 00:35:43,759 --> 00:35:46,640 So to see an example of that, we are 912 00:35:46,640 --> 00:35:50,400 going to use a- we're going to go to the 913 00:35:50,400 --> 00:35:52,280 code section here, all right, so if I 914 00:35:52,280 --> 00:35:54,280 click on the code section here. And right 915 00:35:54,280 --> 00:35:56,400 down here we have see what is called a 916 00:35:56,400 --> 00:35:59,359 dataset notebook. So this is basically a 917 00:35:59,359 --> 00:36:02,319 Jupyter notebook. Jupyter is basically an 918 00:36:02,319 --> 00:36:05,280 Python application which allows you to 919 00:36:05,280 --> 00:36:09,240 create a Python machine learning 920 00:36:09,240 --> 00:36:11,680 program that basically builds your 921 00:36:11,680 --> 00:36:14,520 machine learning model, assesses or 922 00:36:14,520 --> 00:36:16,480 evaluates its accuracy, and generates 923 00:36:16,480 --> 00:36:19,040 predictions from it, okay? So here we have 924 00:36:19,040 --> 00:36:21,680 a whole bunch of Jupyter notebooks that 925 00:36:21,680 --> 00:36:24,560 are available, and you can select any one 926 00:36:24,560 --> 00:36:26,000 of them. All these notebooks are 927 00:36:26,000 --> 00:36:28,720 essentially going to process the data 928 00:36:28,720 --> 00:36:31,720 from this particular dataset. So if I go 929 00:36:31,720 --> 00:36:34,720 to this code page here, I've actually 930 00:36:34,720 --> 00:36:37,319 selected a specific notebook that I'm 931 00:36:37,319 --> 00:36:39,960 going to run through to demonstrate an 932 00:36:39,960 --> 00:36:42,839 end-to-end machine learning workflow using 933 00:36:42,839 --> 00:36:45,560 various machine learning libraries from 934 00:36:45,560 --> 00:36:49,800 the Python programming language, okay? So 935 00:36:49,800 --> 00:36:52,440 the particular notebook I'm going to 936 00:36:52,440 --> 00:36:55,160 use is this particular notebook here, and 937 00:36:55,160 --> 00:36:57,160 you can also get the URL for that 938 00:36:57,160 --> 00:37:00,440 particular notebook from here. 939 00:37:00,440 --> 00:37:03,760 Okay, so let's quickly do a quick 940 00:37:03,760 --> 00:37:05,974 revision again. What are we trying to do 941 00:37:05,974 --> 00:37:08,000 here? We're trying to build a machine 942 00:37:08,000 --> 00:37:11,359 learning classification model, right? So 943 00:37:11,359 --> 00:37:12,960 we said there are two primary areas of 944 00:37:12,960 --> 00:37:14,560 supervised learning, one is regression 945 00:37:14,560 --> 00:37:16,200 which is used to predict a numerical 946 00:37:16,200 --> 00:37:18,640 target variable, and the second kind of 947 00:37:18,640 --> 00:37:21,359 supervised learning is classification 948 00:37:21,359 --> 00:37:23,079 which is what we're doing here. We're 949 00:37:23,079 --> 00:37:25,839 trying to predict a categorical target 950 00:37:25,839 --> 00:37:29,680 variable, okay? So in this particular 951 00:37:29,680 --> 00:37:32,119 example, we actually have two kinds of 952 00:37:32,119 --> 00:37:34,480 ways we can classify, either a binary 953 00:37:34,480 --> 00:37:36,679 classification or a multiclass 954 00:37:36,679 --> 00:37:39,520 classification. So for binary 955 00:37:39,520 --> 00:37:41,440 classification, we are only going to 956 00:37:41,440 --> 00:37:43,400 classify the product or machine as 957 00:37:43,400 --> 00:37:47,160 either it failed or it did not fail, okay? 958 00:37:47,160 --> 00:37:48,880 So if we go back to the dataset that I 959 00:37:48,880 --> 00:37:50,839 showed you just now, if you look at this 960 00:37:50,839 --> 00:37:52,680 target variable column, there are only 961 00:37:52,680 --> 00:37:54,520 two possible values here. They are either 962 00:37:54,520 --> 00:37:58,280 zero or one. Zero means there's no failure. 963 00:37:58,280 --> 00:38:01,240 One means there's a failure, okay? So this 964 00:38:01,240 --> 00:38:03,440 is an example of a binary classification. 965 00:38:03,440 --> 00:38:07,240 Only two possible outcomes, zero or one, 966 00:38:07,240 --> 00:38:10,119 didn't fail or fail, all right? Two 967 00:38:10,119 --> 00:38:13,079 possible outcomes. And then we can also, 968 00:38:13,079 --> 00:38:15,480 for the same dataset, we can extend it 969 00:38:15,480 --> 00:38:18,079 and make it a multiclass classification 970 00:38:18,079 --> 00:38:20,880 problem, all right? So if we kind of want 971 00:38:20,880 --> 00:38:23,720 to drill down further, we can say that 972 00:38:23,720 --> 00:38:26,800 not only is there a failure, we can 973 00:38:26,800 --> 00:38:29,200 actually say there are different types of 974 00:38:29,200 --> 00:38:32,440 failures, okay? So we have one category of 975 00:38:32,440 --> 00:38:35,599 class that is basically no failure, okay? 976 00:38:35,599 --> 00:38:37,400 Then we have a category for the 977 00:38:37,400 --> 00:38:40,400 different types of failures, right? So you 978 00:38:40,400 --> 00:38:43,920 can have a power failure, you could have 979 00:38:43,920 --> 00:38:46,400 a tool wear failure, 980 00:38:46,400 --> 00:38:48,920 you could have- let's go down 981 00:38:48,920 --> 00:38:50,880 here, you could have a overstrain 982 00:38:50,880 --> 00:38:53,760 failure, and etc, etc. So you can have 983 00:38:53,760 --> 00:38:57,160 multiple classes of failure in addition 984 00:38:57,160 --> 00:39:00,520 to the general overall or the majority 985 00:39:00,520 --> 00:39:04,319 class of no failure, and that would be a 986 00:39:04,319 --> 00:39:06,680 multiclass classification problem. So 987 00:39:06,680 --> 00:39:08,400 with this data set, we are going to see 988 00:39:08,400 --> 00:39:11,040 how to make it a binary classification 989 00:39:11,040 --> 00:39:12,800 problem and also a multiclass 990 00:39:12,800 --> 00:39:15,079 classification problem. Okay, so let's 991 00:39:15,079 --> 00:39:16,880 look at the workflow. So let's say we've 992 00:39:16,880 --> 00:39:18,880 already got the data, so right now we do 993 00:39:18,880 --> 00:39:20,839 have the dataset. This is the dataset 994 00:39:20,839 --> 00:39:22,720 that we have, so let's assume we've 995 00:39:22,720 --> 00:39:24,560 somehow managed to get this dataset 996 00:39:24,560 --> 00:39:26,880 from some IoT sensors that are 997 00:39:26,880 --> 00:39:29,119 monitoring real-time data in our 998 00:39:29,119 --> 00:39:31,079 production environment. On the assembly 999 00:39:31,079 --> 00:39:32,800 line, on the production line we've got 1000 00:39:32,800 --> 00:39:34,680 sensors reading data that gives us all 1001 00:39:34,680 --> 00:39:37,960 these data that we have in this CSV file. 1002 00:39:37,960 --> 00:39:40,079 Okay, so we've already got the data, we've 1003 00:39:40,079 --> 00:39:41,599 retrieved the data, now we're going to go 1004 00:39:41,599 --> 00:39:45,000 on to the cleaning and exploration part 1005 00:39:45,000 --> 00:39:47,520 of your machine learning life cycle. All 1006 00:39:47,520 --> 00:39:49,800 right, so let's look at the data cleaning 1007 00:39:49,800 --> 00:39:51,400 part. So the data cleaning part, we're 1008 00:39:51,400 --> 00:39:53,720 interested in checking for missing 1009 00:39:53,720 --> 00:39:56,200 values and maybe removing the rows you 1010 00:39:56,200 --> 00:39:58,079 missing values, okay? 1011 00:39:58,079 --> 00:39:59,760 So the kind of things we can- sorry, 1012 00:39:59,760 --> 00:40:01,000 the kind of things we can do in missing 1013 00:40:01,000 --> 00:40:02,880 values, we can remove the rows missing 1014 00:40:02,880 --> 00:40:05,839 values, we can put in some new values, 1015 00:40:05,839 --> 00:40:08,000 some replacement values which could be a 1016 00:40:08,000 --> 00:40:09,880 average of all the values in that that 1017 00:40:09,880 --> 00:40:12,880 particular column, etc, etc, we could also try to 1018 00:40:12,880 --> 00:40:15,480 identify outliers in our data set and 1019 00:40:15,480 --> 00:40:17,480 also there are a variety of ways to deal 1020 00:40:17,480 --> 00:40:19,480 with that. So this is called data 1021 00:40:19,480 --> 00:40:21,359 cleansing which is a really important 1022 00:40:21,359 --> 00:40:23,319 part of your machine learning workflow, 1023 00:40:23,319 --> 00:40:25,520 right? So that's where we are now at, 1024 00:40:25,520 --> 00:40:26,839 we're doing cleansing, and then we're 1025 00:40:26,839 --> 00:40:27,939 going to follow up with 1026 00:40:27,939 --> 00:40:31,160 exploration. So let's look at the actual 1027 00:40:31,160 --> 00:40:33,160 code that does the cleansing here. So 1028 00:40:33,160 --> 00:40:35,800 here we are right at the start of the 1029 00:40:35,800 --> 00:40:38,248 machine learning life cycle here, so 1030 00:40:38,248 --> 00:40:40,839 this is a Jupyter notebook. So here we 1031 00:40:40,839 --> 00:40:43,359 have a brief description of the problem 1032 00:40:43,359 --> 00:40:45,920 statement, all right? So this dataset 1033 00:40:45,920 --> 00:40:47,640 reflects real life predictive 1034 00:40:47,640 --> 00:40:49,240 maintenance encountered industry with 1035 00:40:49,240 --> 00:40:50,480 measurements from real equipment. The 1036 00:40:50,480 --> 00:40:52,400 features description is taken directly 1037 00:40:52,400 --> 00:40:54,520 from the data source set. So here we have 1038 00:40:54,520 --> 00:40:57,400 a description of the six key features in 1039 00:40:57,400 --> 00:40:59,599 our dataset type which is the quality 1040 00:40:59,599 --> 00:41:02,520 of the product, the air temperature, the 1041 00:41:02,520 --> 00:41:04,680 process temperature, the rotational speed, 1042 00:41:04,680 --> 00:41:06,599 the torque, and the tool wear, all right? So 1043 00:41:06,599 --> 00:41:08,880 these are the six feature variables, and 1044 00:41:08,880 --> 00:41:11,319 there are the two target variables, so 1045 00:41:11,319 --> 00:41:13,119 just now- I showed you just now there's 1046 00:41:13,119 --> 00:41:15,119 one target variable which only has two 1047 00:41:15,119 --> 00:41:17,440 possible values, either zero or one, okay? 1048 00:41:17,440 --> 00:41:20,079 Zero or one means failure or no failure, 1049 00:41:20,079 --> 00:41:23,079 so that will be this column here, right? 1050 00:41:23,079 --> 00:41:24,880 So let me go all the way back up to here. 1051 00:41:24,880 --> 00:41:26,640 So this column here, we already saw it 1052 00:41:26,640 --> 00:41:29,440 only has two possible values, it's either zero or 1053 00:41:29,440 --> 00:41:32,680 one. And then we also have this column 1054 00:41:32,680 --> 00:41:35,040 here, and this column here is basically 1055 00:41:35,040 --> 00:41:38,079 the failure type. And so the- we have- as I 1056 00:41:38,079 --> 00:41:40,800 already demonstrated just now, we do have 1057 00:41:40,800 --> 00:41:43,440 several categories of types of 1058 00:41:43,440 --> 00:41:45,560 failure, and so here we call this 1059 00:41:45,560 --> 00:41:46,235 multiclass 1060 00:41:46,235 --> 00:41:50,000 classification. So we can either build a 1061 00:41:50,000 --> 00:41:51,839 binary classification model for this 1062 00:41:51,839 --> 00:41:53,520 problem domain, or we can build a 1063 00:41:53,520 --> 00:41:54,491 multiclass 1064 00:41:54,491 --> 00:41:58,119 classification problem, all right. So this 1065 00:41:58,119 --> 00:41:59,839 Jupyter notebook is going to demonstrate 1066 00:41:59,839 --> 00:42:02,319 both approaches to us. So first step, we 1067 00:42:02,319 --> 00:42:04,800 are going to write all this Python code 1068 00:42:04,800 --> 00:42:06,880 that's going to import all the libraries 1069 00:42:06,880 --> 00:42:09,079 that we need to use, okay? So this is 1070 00:42:09,079 --> 00:42:12,319 basically Python code, okay, and it's 1071 00:42:12,319 --> 00:42:15,119 importing the relevant machine learn- 1072 00:42:15,119 --> 00:42:17,960 oops. We are importing the relevant 1073 00:42:17,960 --> 00:42:20,599 machine learning libraries related to 1074 00:42:20,599 --> 00:42:23,520 our domain use case, okay? Then we load in 1075 00:42:23,520 --> 00:42:26,440 our dataset, okay, so this our dataset. 1076 00:42:26,440 --> 00:42:28,319 We describe it, we have some quick 1077 00:42:28,319 --> 00:42:30,920 insights into the dataset. And then 1078 00:42:30,920 --> 00:42:32,839 we just take a look at all the variables 1079 00:42:32,839 --> 00:42:36,000 of the feature variables, etc, and so on. 1080 00:42:36,000 --> 00:42:38,000 What we're doing now is just 1081 00:42:38,000 --> 00:42:39,800 doing a quick overview of the dataset, 1082 00:42:39,800 --> 00:42:41,559 so this all this Python code here that 1083 00:42:41,559 --> 00:42:43,760 we're writing is allowing us, the data 1084 00:42:43,760 --> 00:42:45,359 scientist, to get a quick overview of our 1085 00:42:45,359 --> 00:42:48,209 dataset, right, okay, like how many varia- 1086 00:42:48,209 --> 00:42:50,240 how many rows are there, how many columns 1087 00:42:50,240 --> 00:42:51,760 are there, what are the data types of the 1088 00:42:51,760 --> 00:42:53,440 columns, what are the name of the columns, 1089 00:42:53,440 --> 00:42:57,359 etc, etc. Okay, then we zoom in on to the 1090 00:42:57,359 --> 00:42:58,839 target variables. So we look at the 1091 00:42:58,839 --> 00:43:02,000 target variables, how many counts 1092 00:43:02,000 --> 00:43:04,520 there are of this target variable, and 1093 00:43:04,520 --> 00:43:06,440 so on. How many different types of 1094 00:43:06,440 --> 00:43:08,240 failures there are. Then you want to 1095 00:43:08,240 --> 00:43:09,000 check whether there are any 1096 00:43:09,000 --> 00:43:10,760 inconsistencies between the target and 1097 00:43:10,760 --> 00:43:13,559 the failure type, etc. Okay, so when you do 1098 00:43:13,559 --> 00:43:15,119 all this checking, you're going to 1099 00:43:15,119 --> 00:43:16,960 discover there are some discrepancies in 1100 00:43:16,960 --> 00:43:20,280 your dataset, so using a specific Python 1101 00:43:20,280 --> 00:43:21,839 code to do checking, you're going to say 1102 00:43:21,839 --> 00:43:23,480 hey, you know what? There's some errors 1103 00:43:23,480 --> 00:43:25,000 here, right? There are nine values that 1104 00:43:25,000 --> 00:43:26,599 classify as failure in target variable, 1105 00:43:26,599 --> 00:43:28,200 but as no failure in the failure type 1106 00:43:28,200 --> 00:43:29,720 variable, so that means there's a 1107 00:43:29,720 --> 00:43:33,200 discrepancy in your data point, right? 1108 00:43:33,200 --> 00:43:34,760 So these are all the ones that 1109 00:43:34,760 --> 00:43:36,359 are discrepancies because the target 1110 00:43:36,359 --> 00:43:39,000 variable says one, and we already know 1111 00:43:39,000 --> 00:43:41,240 that target variable one is supposed to 1112 00:43:41,240 --> 00:43:43,099 mean there is a failure, right? Target 1113 00:43:43,099 --> 00:43:44,880 variable one is supposed to mean there is 1114 00:43:44,880 --> 00:43:47,119 a failure, so we are kind of expecting to 1115 00:43:47,119 --> 00:43:49,680 see the failure classification, but some 1116 00:43:49,680 --> 00:43:51,400 rows actually say there's no failure 1117 00:43:51,400 --> 00:43:53,800 although the target type is one. Well here 1118 00:43:53,800 --> 00:43:55,920 is a classic example of an error that 1119 00:43:55,920 --> 00:43:58,640 can very well occur in a dataset, so now 1120 00:43:58,640 --> 00:44:00,559 the question is what do you do with 1121 00:44:00,559 --> 00:44:04,720 these errors in your dataset, right? So 1122 00:44:04,720 --> 00:44:06,240 here the data scientist says, I think it 1123 00:44:06,240 --> 00:44:07,520 would make sense to remove those 1124 00:44:07,520 --> 00:44:09,920 instances, and so they write some code 1125 00:44:09,920 --> 00:44:12,680 then to remove those instances or those 1126 00:44:12,680 --> 00:44:14,920 rows or data points from the overall 1127 00:44:14,920 --> 00:44:17,280 data set, and same thing we can, again, 1128 00:44:17,280 --> 00:44:19,240 check for other issues. So we find there's 1129 00:44:19,240 --> 00:44:21,160 another issue here with our data set which 1130 00:44:21,160 --> 00:44:24,079 is another warning, so, again, we can 1131 00:44:24,079 --> 00:44:26,240 possibly remove them. So you're going to 1132 00:44:26,240 --> 00:44:31,280 remove 27 instances or rows from your 1133 00:44:31,280 --> 00:44:34,440 overall data set. So your data set has 1134 00:44:34,440 --> 00:44:37,079 10,000 rows or data points. You're 1135 00:44:37,079 --> 00:44:40,160 removing 27 which is only 0.27 of the 1136 00:44:40,160 --> 00:44:42,240 entire dataset. And these were the 1137 00:44:42,240 --> 00:44:45,720 reasons why you removed them, okay? So if 1138 00:44:45,720 --> 00:44:48,160 you're just removing 0.27% of the 1139 00:44:48,160 --> 00:44:50,800 entire dataset, no big deal, right? Still 1140 00:44:50,800 --> 00:44:53,079 okay, but you needed to remove them 1141 00:44:53,079 --> 00:44:55,000 because these errors right, these 1142 00:44:55,000 --> 00:44:58,040 27 1143 00:44:58,040 --> 00:45:00,559 errors, okay, data points with errors in 1144 00:45:00,559 --> 00:45:02,960 your dataset could really affect the 1145 00:45:02,960 --> 00:45:05,000 training of your machine learning model. 1146 00:45:05,000 --> 00:45:08,640 So we need to do your data cleansing, 1147 00:45:08,640 --> 00:45:11,720 right? So we are actually cleansing now 1148 00:45:11,720 --> 00:45:15,200 some kind of data that is 1149 00:45:15,200 --> 00:45:17,520 incorrect or erroneous in your original 1150 00:45:17,520 --> 00:45:21,440 dataset. Okay, so then we go on to the 1151 00:45:21,440 --> 00:45:23,839 next part which is called EDA, right? So 1152 00:45:23,839 --> 00:45:28,880 EDA is where we kind of explore our data, 1153 00:45:28,880 --> 00:45:31,720 and we want to, kind of, get a visual 1154 00:45:31,720 --> 00:45:34,240 overview of our data as a whole, and also 1155 00:45:34,240 --> 00:45:35,880 take a look at the statistical 1156 00:45:35,880 --> 00:45:38,160 properties of our data. The statistical 1157 00:45:38,160 --> 00:45:40,480 distribution of the data in all the 1158 00:45:40,480 --> 00:45:43,079 various columns, the correlation between 1159 00:45:43,079 --> 00:45:44,640 the variables, between the feature 1160 00:45:44,640 --> 00:45:46,680 variables different columns, and also the 1161 00:45:46,680 --> 00:45:48,599 feature variable and the target variable. 1162 00:45:48,599 --> 00:45:52,040 So all of this is called EDA, and EDA in 1163 00:45:52,040 --> 00:45:54,079 a machine learning workflow is typically 1164 00:45:54,079 --> 00:45:57,160 done through visualization, 1165 00:45:57,160 --> 00:45:58,839 all right? So let's go back here and take 1166 00:45:58,839 --> 00:46:00,599 a look, right? So, for example, here we are 1167 00:46:00,599 --> 00:46:03,400 looking at correlation, so we plot the 1168 00:46:03,400 --> 00:46:05,680 values of all the various feature 1169 00:46:05,680 --> 00:46:07,599 variables against each other and look 1170 00:46:07,599 --> 00:46:10,800 for potential correlations and patterns 1171 00:46:10,800 --> 00:46:13,359 and so on. And all the different shapes 1172 00:46:13,359 --> 00:46:17,280 that you see here in this pair plot, okay, 1173 00:46:17,280 --> 00:46:18,400 will have different meaning, 1174 00:46:18,400 --> 00:46:20,000 statistical meaning, and so the data 1175 00:46:20,000 --> 00:46:21,800 scientist has to, kind of, visually 1176 00:46:21,800 --> 00:46:23,760 inspect this pair plot, make some 1177 00:46:23,760 --> 00:46:25,559 interpretations of these different 1178 00:46:25,559 --> 00:46:27,680 patterns that he sees here, all right. So 1179 00:46:27,680 --> 00:46:30,480 these are some of the insights that 1180 00:46:30,480 --> 00:46:32,839 can be deduced from looking at these 1181 00:46:32,839 --> 00:46:34,319 patterns, so, for example, the torque and 1182 00:46:34,319 --> 00:46:36,280 rotational speed are highly correlated, 1183 00:46:36,280 --> 00:46:38,040 the process temperature and air 1184 00:46:38,040 --> 00:46:39,920 temperature also highly correlated, that 1185 00:46:39,920 --> 00:46:41,559 failures occur for extreme values of 1186 00:46:41,559 --> 00:46:44,520 some features, etc, etc. Then you can plot 1187 00:46:44,520 --> 00:46:45,960 certain kinds of charts. This called a 1188 00:46:45,960 --> 00:46:48,480 violin chart to, again, get new insights. 1189 00:46:48,480 --> 00:46:49,839 For example, regarding the torque and 1190 00:46:49,839 --> 00:46:51,480 rotational speed, it can see, again, that 1191 00:46:51,480 --> 00:46:53,119 most failures are triggered for much 1192 00:46:53,119 --> 00:46:55,119 lower or much higher values than the 1193 00:46:55,119 --> 00:46:57,400 mean when they're not failing. So all 1194 00:46:57,400 --> 00:47:00,720 these visualizations, they are there, and 1195 00:47:00,720 --> 00:47:02,480 a trained data scientist can look at 1196 00:47:02,480 --> 00:47:05,079 them, inspect them, and make some kind of 1197 00:47:05,079 --> 00:47:08,400 insightful deductions from them, okay? 1198 00:47:08,400 --> 00:47:11,079 Percentage of failure, right? The 1199 00:47:11,079 --> 00:47:13,640 correlation heat map, okay, between all 1200 00:47:13,640 --> 00:47:15,559 these different feature variables, and 1201 00:47:15,559 --> 00:47:16,430 also the target 1202 00:47:16,430 --> 00:47:19,599 variable, okay? The product types, 1203 00:47:19,599 --> 00:47:21,079 percentage of product types, percentage 1204 00:47:21,079 --> 00:47:23,160 of failure with respect to the product 1205 00:47:23,160 --> 00:47:25,720 type, so we can also kind of visualize 1206 00:47:25,720 --> 00:47:27,800 that as well. So certain products have a 1207 00:47:27,800 --> 00:47:29,839 higher ratio of failure compared to other 1208 00:47:29,839 --> 00:47:33,240 product types, etc. Or, for example, M 1209 00:47:33,240 --> 00:47:35,800 tends to fail more than H products, etc, 1210 00:47:35,800 --> 00:47:38,880 etc. So we can create a vast variety of 1211 00:47:38,880 --> 00:47:41,319 visualizations in the EDA stage, so you 1212 00:47:41,319 --> 00:47:43,960 can see here. And, again, the idea of this 1213 00:47:43,960 --> 00:47:46,359 visualization is just to give us some 1214 00:47:46,359 --> 00:47:49,680 insight, some preliminary insight into 1215 00:47:49,680 --> 00:47:52,520 our dataset that helps us to model it 1216 00:47:52,520 --> 00:47:54,119 more correctly. So some more insights 1217 00:47:54,119 --> 00:47:56,200 that we get into our data set from all 1218 00:47:56,200 --> 00:47:57,599 this visualization. 1219 00:47:57,599 --> 00:47:59,559 Then we can plot the distribution so we 1220 00:47:59,559 --> 00:48:00,720 can see whether it's a normal 1221 00:48:00,720 --> 00:48:02,789 distribution or some other kind of 1222 00:48:02,789 --> 00:48:05,640 distribution. We can have a box plot 1223 00:48:05,640 --> 00:48:07,760 to see whether there are any outliers in 1224 00:48:07,760 --> 00:48:10,400 your data set and so on, right? So we can 1225 00:48:10,400 --> 00:48:11,640 see from the box plots, we can see 1226 00:48:11,640 --> 00:48:14,599 rotational speed and have outliers. So we 1227 00:48:14,599 --> 00:48:16,880 already saw outliers are basically a 1228 00:48:16,880 --> 00:48:18,800 problem that you may need to kind of 1229 00:48:18,800 --> 00:48:22,520 tackle, right? So outliers are an issue, 1230 00:48:22,520 --> 00:48:24,800 it's a part of data cleansing. And 1231 00:48:24,800 --> 00:48:26,960 so you may need to tackle this, so we may 1232 00:48:26,960 --> 00:48:28,880 have to check okay, well where are the 1233 00:48:28,880 --> 00:48:31,319 potential outliers so we can analyze 1234 00:48:31,319 --> 00:48:35,319 them from the box plot, okay? But then 1235 00:48:35,319 --> 00:48:37,079 we can say well they are outliers, but 1236 00:48:37,079 --> 00:48:38,800 maybe they're not really horrible 1237 00:48:38,800 --> 00:48:40,760 outliers so we can tolerate them or 1238 00:48:40,760 --> 00:48:42,880 maybe we want to remove them. So we can 1239 00:48:42,880 --> 00:48:44,920 see what our mean and maximum values for 1240 00:48:44,920 --> 00:48:46,720 all these with respect to product type, 1241 00:48:46,720 --> 00:48:49,680 how many of them are above or highly 1242 00:48:49,680 --> 00:48:51,440 correlated with the product type in 1243 00:48:51,440 --> 00:48:54,240 terms of the maximum and minimum, okay, 1244 00:48:54,240 --> 00:48:56,960 and then so on. So the insight is well we 1245 00:48:56,960 --> 00:48:59,599 got 4.8% of the instances are outliers, 1246 00:48:59,599 --> 00:49:02,559 so maybe 4.87% is not really that much, 1247 00:49:02,559 --> 00:49:04,920 the outliers are not horrible, so we just 1248 00:49:04,920 --> 00:49:06,960 leave them in the dataset. Now for a 1249 00:49:06,960 --> 00:49:08,520 different dataset, the data scientist 1250 00:49:08,520 --> 00:49:10,280 could come to a different conclusion, so 1251 00:49:10,280 --> 00:49:12,280 then they would do whatever they've 1252 00:49:12,280 --> 00:49:15,400 deemed is appropriate to, kind of, cleanse 1253 00:49:15,400 --> 00:49:18,079 the dataset. Okay, so now that we have 1254 00:49:18,079 --> 00:49:20,000 done all the EDA, the next thing we're 1255 00:49:20,000 --> 00:49:23,160 going to do is we are going to do what 1256 00:49:23,160 --> 00:49:26,200 is called feature engineering. So we are 1257 00:49:26,200 --> 00:49:28,760 going to transform our original feature 1258 00:49:28,760 --> 00:49:31,280 variables and these are our original 1259 00:49:31,280 --> 00:49:32,960 feature variables, right? These are our 1260 00:49:32,960 --> 00:49:35,040 original feature variables, and we are 1261 00:49:35,040 --> 00:49:37,760 going to transform them, all right? We're 1262 00:49:37,760 --> 00:49:40,319 going to transform them in some sense 1263 00:49:40,319 --> 00:49:43,760 into some other form before we fit this 1264 00:49:43,760 --> 00:49:45,640 for training into our machine learning 1265 00:49:45,640 --> 00:49:48,599 algorithm, all right? So these are 1266 00:49:48,599 --> 00:49:51,599 examples of- let's say these are examples of a 1267 00:49:51,599 --> 00:49:55,200 original data set, right? And this is 1268 00:49:55,200 --> 00:49:56,839 examples, these are some of the examples, 1269 00:49:56,839 --> 00:49:58,040 you don't have to use all of them, but 1270 00:49:58,040 --> 00:49:59,440 these are some of the examples of what we 1271 00:49:59,440 --> 00:50:00,839 call feature engineering which you can 1272 00:50:00,839 --> 00:50:03,559 then transform your original values in 1273 00:50:03,559 --> 00:50:05,280 your feature variables to all these 1274 00:50:05,280 --> 00:50:07,920 transform values here. So we're going to 1275 00:50:07,920 --> 00:50:09,680 pretty much do that here, so we have a 1276 00:50:09,680 --> 00:50:12,599 ordinal encoding, we do scaling of the 1277 00:50:12,599 --> 00:50:14,839 data so the dataset is scaled, we use a 1278 00:50:14,839 --> 00:50:18,240 MinMax scaling, and then finally, we come 1279 00:50:18,240 --> 00:50:21,720 to do a modeling. So we have to split our 1280 00:50:21,720 --> 00:50:24,359 dataset into a training dataset and a 1281 00:50:24,359 --> 00:50:28,640 test dataset. So coming back to here again, 1282 00:50:28,640 --> 00:50:32,160 we said that before you train your 1283 00:50:32,160 --> 00:50:33,799 model, sorry, before you train your model, 1284 00:50:33,799 --> 00:50:35,599 you have to take your original dataset, 1285 00:50:35,599 --> 00:50:37,319 now this is a featured engineered dataset. 1286 00:50:37,319 --> 00:50:38,839 We're going to break it into two or 1287 00:50:38,839 --> 00:50:40,839 more subsets, okay. So one is called the 1288 00:50:40,839 --> 00:50:42,400 training dataset that we use to feed 1289 00:50:42,400 --> 00:50:44,000 and train a machine learning model. The 1290 00:50:44,000 --> 00:50:45,920 second is test dataset to evaluate the 1291 00:50:45,920 --> 00:50:47,960 accuracy of the model, okay? So we got 1292 00:50:47,960 --> 00:50:50,939 this training dataset, your test dataset, 1293 00:50:50,939 --> 00:50:52,720 and we also need 1294 00:50:52,720 --> 00:50:56,160 to sample. So from our original data set 1295 00:50:56,160 --> 00:50:57,400 we need to sample some points 1296 00:50:57,400 --> 00:50:58,839 that go into your training dataset, some 1297 00:50:58,839 --> 00:51:00,559 points that go in your test dataset. So 1298 00:51:00,559 --> 00:51:02,720 there are many ways to do sampling. One 1299 00:51:02,720 --> 00:51:04,920 way is to do stratified sampling where 1300 00:51:04,920 --> 00:51:06,720 we ensure the same proportion of data 1301 00:51:06,720 --> 00:51:09,000 from each stata or class because right 1302 00:51:09,000 --> 00:51:10,960 now we have a multiclass classification 1303 00:51:10,960 --> 00:51:12,319 problem, so you want to make sure the 1304 00:51:12,319 --> 00:51:13,960 same proportion of data from each strata or 1305 00:51:13,960 --> 00:51:15,839 class is equally proportional in the 1306 00:51:15,839 --> 00:51:17,920 training and test dataset as the 1307 00:51:17,920 --> 00:51:20,119 original dataset which is very useful 1308 00:51:20,119 --> 00:51:21,640 for dealing with what is called an 1309 00:51:21,640 --> 00:51:24,319 imbalanced dataset. So here we have an 1310 00:51:24,319 --> 00:51:25,839 example of what is called an imbalanced 1311 00:51:25,839 --> 00:51:29,520 dataset in the sense that you have the 1312 00:51:29,520 --> 00:51:32,760 vast majority of data points in your 1313 00:51:32,760 --> 00:51:34,960 data set, they are going to have the 1314 00:51:34,960 --> 00:51:37,480 value of zero for their target variable 1315 00:51:37,480 --> 00:51:40,200 column. So only a extremely small 1316 00:51:40,200 --> 00:51:43,443 minority of the data points in your dataset 1317 00:51:43,443 --> 00:51:45,319 will actually have the value of one 1318 00:51:45,319 --> 00:51:48,720 for their target variable column, okay? So 1319 00:51:48,720 --> 00:51:51,040 a situation where you have your class or 1320 00:51:51,040 --> 00:51:52,520 your target variable column where the 1321 00:51:52,520 --> 00:51:54,480 vast majority of values are from one 1322 00:51:54,480 --> 00:51:58,119 class and a tiny small minority are from 1323 00:51:58,119 --> 00:52:00,520 another class, we call this an imbalanced 1324 00:52:00,520 --> 00:52:02,720 dataset. And for an imbalanced dataset, 1325 00:52:02,720 --> 00:52:04,319 typically we will have a specific 1326 00:52:04,319 --> 00:52:05,920 technique to do the train test split 1327 00:52:05,920 --> 00:52:08,119 which is called stratified sampling, and 1328 00:52:08,119 --> 00:52:09,599 so that's what's exactly happening here. 1329 00:52:09,599 --> 00:52:12,000 We're doing a stratified split here, so 1330 00:52:12,000 --> 00:52:14,839 we are doing a train test split here, 1331 00:52:14,839 --> 00:52:17,520 and we are doing a stratified split. 1332 00:52:17,520 --> 00:52:20,359 And then now we actually develop the 1333 00:52:20,359 --> 00:52:23,359 models. So now we've got the train test 1334 00:52:23,359 --> 00:52:25,480 split, now here is where we actually 1335 00:52:25,480 --> 00:52:27,079 train the models. 1336 00:52:27,079 --> 00:52:29,920 Now in terms of classification there are 1337 00:52:29,920 --> 00:52:31,299 a whole bunch of 1338 00:52:31,299 --> 00:52:35,400 possibilities, right, that you can use. 1339 00:52:35,400 --> 00:52:38,480 There are many, many different algorithms 1340 00:52:38,480 --> 00:52:41,000 that we can use to create a 1341 00:52:41,000 --> 00:52:42,839 classification model. So these are an 1342 00:52:42,839 --> 00:52:45,079 example of some of the more common ones. 1343 00:52:45,079 --> 00:52:47,480 Logistic, support vector machine, decision 1344 00:52:47,480 --> 00:52:49,520 trees, random forest, bagging, balanced 1345 00:52:49,520 --> 00:52:52,720 bagging, boost, ensemble. So all 1346 00:52:52,720 --> 00:52:55,040 these are different algorithms which 1347 00:52:55,040 --> 00:52:57,760 will create different kinds of models 1348 00:52:57,760 --> 00:53:01,599 which will result in different accuracy 1349 00:53:01,599 --> 00:53:05,400 measures, okay? So it's the goal of the 1350 00:53:05,400 --> 00:53:08,920 data scientist to find the best model 1351 00:53:08,920 --> 00:53:11,520 that gives the best accuracy for the 1352 00:53:11,520 --> 00:53:14,119 given dataset, for training on that 1353 00:53:14,119 --> 00:53:16,880 given dataset. So let's head back, again, 1354 00:53:16,880 --> 00:53:19,760 to our machine learning workflow. So 1355 00:53:19,760 --> 00:53:21,520 here basically what I'm doing is I'm 1356 00:53:21,520 --> 00:53:23,520 creating a whole bunch of models here, 1357 00:53:23,520 --> 00:53:25,520 all right? So one is a random forest, one 1358 00:53:25,520 --> 00:53:27,160 is balanced bagging, one is a boost 1359 00:53:27,160 --> 00:53:29,520 classifier, one's a ensemble classifier, 1360 00:53:29,520 --> 00:53:32,760 and using all of these, I am going to 1361 00:53:32,760 --> 00:53:35,319 basically feed or train my model using 1362 00:53:35,319 --> 00:53:37,440 all these algorithms. And then I'm going 1363 00:53:37,440 --> 00:53:39,799 to evaluate them, okay? I'm going to 1364 00:53:39,799 --> 00:53:42,480 evaluate how good each of these models 1365 00:53:42,480 --> 00:53:45,760 are. And here you can see your 1366 00:53:45,760 --> 00:53:48,839 evaluation data, right? Okay and this is 1367 00:53:48,839 --> 00:53:50,839 the confusion matrix which is another 1368 00:53:50,839 --> 00:53:54,280 way of evaluating. So now we come to the, 1369 00:53:54,280 --> 00:53:56,319 kind of, the key part here which 1370 00:53:56,319 --> 00:53:58,520 is how do I distinguish between 1371 00:53:58,520 --> 00:54:00,079 all these models, right? I've got all 1372 00:54:00,079 --> 00:54:01,400 these different models which are built 1373 00:54:01,400 --> 00:54:03,040 with different algorithms which I'm 1374 00:54:03,040 --> 00:54:05,359 using to train on the same dataset, how 1375 00:54:05,359 --> 00:54:07,359 do I distinguish between all these 1376 00:54:07,359 --> 00:54:10,359 models, okay? And so for that sense, for 1377 00:54:10,359 --> 00:54:13,880 that we actually have a whole bunch of 1378 00:54:13,880 --> 00:54:16,200 common evaluation metrics for 1379 00:54:16,200 --> 00:54:18,319 classification, right? So this evaluation 1380 00:54:18,319 --> 00:54:22,240 metrics tell us how good a model is in 1381 00:54:22,240 --> 00:54:24,319 terms of its accuracy in 1382 00:54:24,319 --> 00:54:27,000 classification. So in terms of 1383 00:54:27,000 --> 00:54:29,440 accuracy, we actually have many different 1384 00:54:29,440 --> 00:54:31,680 models, sorry, many different measures, 1385 00:54:31,680 --> 00:54:33,440 right? You might think well, accuracy is 1386 00:54:33,440 --> 00:54:35,400 just accuracy, well that's all right, it's 1387 00:54:35,400 --> 00:54:36,880 just either it's accurate or it's not 1388 00:54:36,880 --> 00:54:39,319 accurate, right? But actually it's not 1389 00:54:39,319 --> 00:54:41,359 that simple. There are many different 1390 00:54:41,359 --> 00:54:43,839 ways to measure the accuracy of a 1391 00:54:43,839 --> 00:54:45,480 classification model, and these are some 1392 00:54:45,480 --> 00:54:48,280 of the more common ones. So, for example, 1393 00:54:48,280 --> 00:54:51,000 the confusion matrix tells us how many 1394 00:54:51,000 --> 00:54:54,000 true positives, that means the value is 1395 00:54:54,000 --> 00:54:55,880 positive, the prediction is positive, how 1396 00:54:55,880 --> 00:54:57,520 many false positives which means the 1397 00:54:57,520 --> 00:54:59,040 value is negative the machine learning 1398 00:54:59,040 --> 00:55:01,839 model predicts positive. How many false 1399 00:55:01,839 --> 00:55:03,839 negatives which means that the machine 1400 00:55:03,839 --> 00:55:05,559 learning model predicts negative, but 1401 00:55:05,559 --> 00:55:07,480 it's actually positive. And how many true 1402 00:55:07,480 --> 00:55:09,359 negatives there are which means that the 1403 00:55:09,359 --> 00:55:11,240 the machine learning model 1404 00:55:11,240 --> 00:55:12,880 predicts negative and the true value is 1405 00:55:12,880 --> 00:55:14,760 also negative. So this is called a 1406 00:55:14,760 --> 00:55:16,920 confusion matrix. This is one way we 1407 00:55:16,920 --> 00:55:19,480 assess or evaluate the performance of a 1408 00:55:19,480 --> 00:55:20,520 classification model, 1409 00:55:20,520 --> 00:55:23,319 okay? This is for binary 1410 00:55:23,319 --> 00:55:24,680 classification, we can also have 1411 00:55:24,680 --> 00:55:26,880 multiclass confusion matrix, 1412 00:55:26,880 --> 00:55:29,000 and then we can also measure things like 1413 00:55:29,000 --> 00:55:31,720 accuracy. So accuracy is the true 1414 00:55:31,720 --> 00:55:34,079 positives plus the true negatives which 1415 00:55:34,079 --> 00:55:35,440 is the total number of correct 1416 00:55:35,440 --> 00:55:37,839 predictions made by the model divided by 1417 00:55:37,839 --> 00:55:39,839 the total number of data points in your 1418 00:55:39,839 --> 00:55:42,599 dataset. And then you have also other 1419 00:55:42,599 --> 00:55:43,150 kinds of 1420 00:55:43,150 --> 00:55:46,599 measures such as recall. And this a 1421 00:55:46,599 --> 00:55:49,160 formula for recall, this is a formula for 1422 00:55:49,160 --> 00:55:51,480 the F1 score, okay? And then there's 1423 00:55:51,480 --> 00:55:55,559 something called the ROC curve, right? So 1424 00:55:55,559 --> 00:55:57,039 without going too much in the detail of 1425 00:55:57,039 --> 00:55:59,000 what each of these entails, essentially 1426 00:55:59,000 --> 00:56:00,640 these are all different ways, these are 1427 00:56:00,640 --> 00:56:03,280 different KPI, right? Just like if you 1428 00:56:03,280 --> 00:56:06,119 work in a company, you have different KPI, 1429 00:56:06,119 --> 00:56:08,079 right? Certain employees have certain KPI 1430 00:56:08,079 --> 00:56:11,280 that measures how good or how, you 1431 00:56:11,280 --> 00:56:13,200 know, efficient or how effective a 1432 00:56:13,200 --> 00:56:15,502 particular employee is, right? So the 1433 00:56:15,502 --> 00:56:19,880 KPI for your machine learning models 1434 00:56:19,880 --> 00:56:24,240 are ROC curve, F1 score, recall, accuracy, 1435 00:56:24,240 --> 00:56:26,599 okay, and your confusion matrix. So 1436 00:56:26,599 --> 00:56:29,839 fundamentally after I have built, right, 1437 00:56:29,839 --> 00:56:33,359 so here I've built my four different 1438 00:56:33,359 --> 00:56:35,240 models. So after I built these four 1439 00:56:35,240 --> 00:56:37,640 different models, I'm going to check and 1440 00:56:37,640 --> 00:56:39,680 evaluate them using all those different 1441 00:56:39,680 --> 00:56:42,440 metrics like, for example, the F1 score, 1442 00:56:42,440 --> 00:56:44,839 the precision score, the recall score, all 1443 00:56:44,839 --> 00:56:47,319 right. So for this model, I can check out 1444 00:56:47,319 --> 00:56:50,039 the ROC score, the F1 score, the precision 1445 00:56:50,039 --> 00:56:52,119 score, the recall score. Then for this 1446 00:56:52,119 --> 00:56:54,799 model, this is the ROC score, the F1 score, 1447 00:56:54,799 --> 00:56:56,839 the precision score, the recall score. 1448 00:56:56,839 --> 00:56:59,680 Then for this model and so on. So for 1449 00:56:59,680 --> 00:57:03,240 every single model I've created using my 1450 00:57:03,240 --> 00:57:05,839 training data set, I will have all my set 1451 00:57:05,839 --> 00:57:08,000 of evaluation metrics that I can use to 1452 00:57:08,000 --> 00:57:11,839 evaluate how good this model is, okay? 1453 00:57:11,839 --> 00:57:13,119 Same thing here, I've got a confusion 1454 00:57:13,119 --> 00:57:15,079 matrix here, right, so I can use that, 1455 00:57:15,079 --> 00:57:18,119 again, to evaluate between all these four 1456 00:57:18,119 --> 00:57:20,200 different models, and then I, kind of, 1457 00:57:20,200 --> 00:57:22,240 summarize it up here. So we can see from 1458 00:57:22,240 --> 00:57:25,440 this summary here that actually the top 1459 00:57:25,440 --> 00:57:27,599 two models, right, which are I'm going to 1460 00:57:27,599 --> 00:57:29,440 give a lot, as a data scientist, I'm now 1461 00:57:29,440 --> 00:57:31,119 going to just focus on these two models. 1462 00:57:31,119 --> 00:57:33,440 So these two models are bagging 1463 00:57:33,440 --> 00:57:36,000 classifier and random forest classifier. 1464 00:57:36,000 --> 00:57:38,480 They have the highest values of F1 score, 1465 00:57:38,480 --> 00:57:40,480 and the highest values of the ROC curve 1466 00:57:40,480 --> 00:57:42,640 score, okay? So we can say these are the 1467 00:57:42,640 --> 00:57:45,839 top two models in terms of accuracy, okay, 1468 00:57:45,839 --> 00:57:48,920 using the F1 evaluation metric and the 1469 00:57:48,920 --> 00:57:53,720 ROC AUC evaluation metric, okay? So these 1470 00:57:53,720 --> 00:57:57,480 results, kind of, summarize here, and 1471 00:57:57,480 --> 00:57:59,079 then we use different sampling 1472 00:57:59,079 --> 00:58:00,880 techniques, okay, so just now I talked 1473 00:58:00,880 --> 00:58:03,680 about um different kinds of sampling 1474 00:58:03,680 --> 00:58:06,400 techniques and so the idea of different 1475 00:58:06,400 --> 00:58:08,319 kinds of sampling techniques is to just 1476 00:58:08,319 --> 00:58:11,319 get a different feel for different 1477 00:58:11,319 --> 00:58:13,720 distributions of the data in different 1478 00:58:13,720 --> 00:58:16,359 areas of your data set so that you want 1479 00:58:16,359 --> 00:58:20,000 to just kind of make sure that your your 1480 00:58:20,000 --> 00:58:22,799 your evaluation of accuracy is actually 1481 00:58:22,799 --> 00:58:27,079 statistically correct right so we can um 1482 00:58:27,079 --> 00:58:29,599 do what is called oversampling and under 1483 00:58:29,599 --> 00:58:30,880 sampling which is very useful when 1484 00:58:30,880 --> 00:58:32,280 you're working with an imbalance data 1485 00:58:32,280 --> 00:58:35,039 set so this is example of doing that and 1486 00:58:35,039 --> 00:58:37,240 then here we again again check out the 1487 00:58:37,240 --> 00:58:38,799 results for all these different 1488 00:58:38,799 --> 00:58:41,680 techniques we use uh the F1 score the Au 1489 00:58:41,680 --> 00:58:43,599 score all right these are the two key 1490 00:58:43,599 --> 00:58:46,760 measures of accuracy right so and then 1491 00:58:46,760 --> 00:58:47,920 we can check out the scores for the 1492 00:58:47,920 --> 00:58:50,480 different approaches okay so we can see 1493 00:58:50,480 --> 00:58:53,119 oh well overall the models have lower Au 1494 00:58:53,119 --> 00:58:55,720 r r Au C score but they have a much 1495 00:58:55,720 --> 00:58:58,280 higher F1 score the begging classifier 1496 00:58:58,280 --> 00:59:00,839 had the highest R1 highest roc1 score 1497 00:59:00,839 --> 00:59:04,119 but F1 score was too low okay then in 1498 00:59:04,119 --> 00:59:06,520 the data scientist opinion the random 1499 00:59:06,520 --> 00:59:08,520 forest with this particular technique of 1500 00:59:08,520 --> 00:59:10,760 sampling has equilibrium between the F1 1501 00:59:10,760 --> 00:59:14,480 R F1 R and A score so the takeaway one 1502 00:59:14,480 --> 00:59:16,680 is the macro F1 score improves 1503 00:59:16,680 --> 00:59:18,480 dramatically using the sampl sampling 1504 00:59:18,480 --> 00:59:20,160 techniqu so these models might be better 1505 00:59:20,160 --> 00:59:22,440 compared to the balanced ones all right 1506 00:59:22,440 --> 00:59:26,280 so based on all this uh evaluation the 1507 00:59:26,280 --> 00:59:27,680 data scientist says they're going to 1508 00:59:27,680 --> 00:59:29,920 continue to work with these two models 1509 00:59:29,920 --> 00:59:31,440 all right and the balance begging one 1510 00:59:31,440 --> 00:59:33,079 and then continue to make further 1511 00:59:33,079 --> 00:59:35,039 comparisons all right so then we 1512 00:59:35,039 --> 00:59:37,079 continue to keep refining on our 1513 00:59:37,079 --> 00:59:38,599 evaluation work here we're going to 1514 00:59:38,599 --> 00:59:41,000 train the models one more time again so 1515 00:59:41,000 --> 00:59:43,039 we again do a training test plate and 1516 00:59:43,039 --> 00:59:44,799 then we do that for this particular uh 1517 00:59:44,799 --> 00:59:47,039 approach model and then we print out we 1518 00:59:47,039 --> 00:59:48,200 print out what is called a 1519 00:59:48,200 --> 00:59:50,960 classification report and this is 1520 00:59:50,960 --> 00:59:53,400 basically a summary of all those metrics 1521 00:59:53,400 --> 00:59:55,359 that I talk about just now so just now 1522 00:59:55,359 --> 00:59:57,520 remember I said the the there was 1523 00:59:57,520 --> 00:59:59,680 several evaluation metrics right so uh 1524 00:59:59,680 --> 01:00:01,480 we had the confusion matrics the 1525 01:00:01,480 --> 01:00:04,119 accuracy the Precision the recall the Au 1526 01:00:04,119 --> 01:00:08,119 ccore so here with the um classification 1527 01:00:08,119 --> 01:00:09,880 report I can get a summary of all of 1528 01:00:09,880 --> 01:00:11,760 that so I can see all the values here 1529 01:00:11,760 --> 01:00:14,640 okay for this particular model begging 1530 01:00:14,640 --> 01:00:17,160 Tomac links and then I can do that for 1531 01:00:17,160 --> 01:00:18,640 another model the random Forest 1532 01:00:18,640 --> 01:00:20,599 borderline SME and then I can do that 1533 01:00:20,599 --> 01:00:22,200 for another model which is the balance 1534 01:00:22,200 --> 01:00:25,160 ping so again we see this a lot of 1535 01:00:25,160 --> 01:00:27,079 comparison between different models 1536 01:00:27,079 --> 01:00:28,640 trying to figure out what all these 1537 01:00:28,640 --> 01:00:30,720 evaluation metrics are telling us all 1538 01:00:30,720 --> 01:00:32,960 right then again we have a confusion 1539 01:00:32,960 --> 01:00:35,880 Matrix so we generate a confusion Matrix 1540 01:00:35,880 --> 01:00:38,880 for the bagging with the toac links 1541 01:00:38,880 --> 01:00:40,720 under sampling for the random followers 1542 01:00:40,720 --> 01:00:42,680 with the borderline mod over sampling 1543 01:00:42,680 --> 01:00:44,960 and just balance begging by itself then 1544 01:00:44,960 --> 01:00:47,720 again we compare between these three uh 1545 01:00:47,720 --> 01:00:50,799 models uh using the confusion Matrix 1546 01:00:50,799 --> 01:00:52,599 evaluation Matrix and then we can kind 1547 01:00:52,599 --> 01:00:55,680 of come to some conclusions all right so 1548 01:00:55,680 --> 01:00:58,160 right so now we look at all the data 1549 01:00:58,160 --> 01:01:01,200 then we move on and look at another um 1550 01:01:01,200 --> 01:01:03,160 another kind of evaluation metrix which 1551 01:01:03,160 --> 01:01:06,720 is the r score right so this is one of 1552 01:01:06,720 --> 01:01:08,680 the other evaluation metrics I talk 1553 01:01:08,680 --> 01:01:11,200 about so this one is a kind of a curve 1554 01:01:11,200 --> 01:01:12,520 you look at it to see the area 1555 01:01:12,520 --> 01:01:14,359 underneath the curve this is called AOC 1556 01:01:14,359 --> 01:01:18,079 R area under the curve sorry Au Au R 1557 01:01:18,079 --> 01:01:19,880 area under the curve all right so the 1558 01:01:19,880 --> 01:01:21,839 area under the curve uh 1559 01:01:21,839 --> 01:01:24,319 score will give us some idea about the 1560 01:01:24,319 --> 01:01:25,599 threshold that we're going to use for 1561 01:01:25,599 --> 01:01:27,680 classif ification so we can examine this 1562 01:01:27,680 --> 01:01:29,200 for the bagging classifier for the 1563 01:01:29,200 --> 01:01:30,960 random forest classifier for the balance 1564 01:01:30,960 --> 01:01:33,599 bagging classifier okay then we can also 1565 01:01:33,599 --> 01:01:36,200 again do that uh finally we can check 1566 01:01:36,200 --> 01:01:37,880 the classification report of this 1567 01:01:37,880 --> 01:01:39,680 particular model so we keep doing this 1568 01:01:39,680 --> 01:01:43,200 over and over again evaluating this m 1569 01:01:43,200 --> 01:01:45,720 The Matrix the the accuracy Matrix the 1570 01:01:45,720 --> 01:01:46,880 evaluation Matrix for all these 1571 01:01:46,880 --> 01:01:48,880 different models so we keep doing this 1572 01:01:48,880 --> 01:01:50,520 over and over again for different 1573 01:01:50,520 --> 01:01:53,440 thresholds or for classification and so 1574 01:01:53,440 --> 01:01:56,880 as we keep drilling into these we kind 1575 01:01:56,880 --> 01:02:00,839 of get more and more understanding of 1576 01:02:00,839 --> 01:02:02,799 all these different models which one is 1577 01:02:02,799 --> 01:02:04,760 the best one that gives the best 1578 01:02:04,760 --> 01:02:08,520 performance for our data set okay so 1579 01:02:08,520 --> 01:02:11,440 finally we come to this conclusion this 1580 01:02:11,440 --> 01:02:13,520 particular model is not able to reduce 1581 01:02:13,520 --> 01:02:15,279 the record on failure test than 1582 01:02:15,279 --> 01:02:17,520 95.8% on the other hand balance begging 1583 01:02:17,520 --> 01:02:19,400 with a decision thresold of 0.6 is able 1584 01:02:19,400 --> 01:02:21,520 to have a better recall blah blah blah 1585 01:02:21,520 --> 01:02:25,319 Etc so finally after having done all of 1586 01:02:25,319 --> 01:02:27,480 this evalu ations 1587 01:02:27,480 --> 01:02:31,119 okay this is the conclusion 1588 01:02:31,119 --> 01:02:33,960 so after having gone so right now we 1589 01:02:33,960 --> 01:02:35,279 have gone through all the steps of the 1590 01:02:35,279 --> 01:02:37,760 Machining learning life cycle and which 1591 01:02:37,760 --> 01:02:40,240 means we have right now or the data 1592 01:02:40,240 --> 01:02:41,960 scientist right now has gone through all 1593 01:02:41,960 --> 01:02:43,000 these 1594 01:02:43,000 --> 01:02:47,079 steps uh which is now we have done this 1595 01:02:47,079 --> 01:02:48,640 validation so we have done the cleaning 1596 01:02:48,640 --> 01:02:50,559 exploration preparation transformation 1597 01:02:50,559 --> 01:02:52,599 the future engineering we have developed 1598 01:02:52,599 --> 01:02:54,359 and trained multiple models we have 1599 01:02:54,359 --> 01:02:56,480 evaluated all these different models so 1600 01:02:56,480 --> 01:02:58,599 right now we have reached this stage so 1601 01:02:58,599 --> 01:03:02,720 at this stage we as the data scientist 1602 01:03:02,720 --> 01:03:05,480 kind of have completed our job so we've 1603 01:03:05,480 --> 01:03:08,119 come to some very useful conclusions 1604 01:03:08,119 --> 01:03:09,640 which we now can share with our 1605 01:03:09,640 --> 01:03:13,240 colleagues all right and based on this 1606 01:03:13,240 --> 01:03:15,400 uh conclusions or recommendations 1607 01:03:15,400 --> 01:03:17,160 somebody is going to choose a 1608 01:03:17,160 --> 01:03:19,160 appropriate model and that model is 1609 01:03:19,160 --> 01:03:22,640 going to get deployed for realtime use 1610 01:03:22,640 --> 01:03:25,319 in a real life production environment 1611 01:03:25,319 --> 01:03:27,240 okay and that decision is going to be 1612 01:03:27,240 --> 01:03:29,359 made based on the recommendations coming 1613 01:03:29,359 --> 01:03:30,880 from the data scientist at the end of 1614 01:03:30,880 --> 01:03:33,480 this phase okay so at the end of this 1615 01:03:33,480 --> 01:03:35,079 phase the data scientist is going to 1616 01:03:35,079 --> 01:03:36,880 come up with these conclusions so 1617 01:03:36,880 --> 01:03:41,760 conclusions is okay if the engineering 1618 01:03:41,760 --> 01:03:44,520 team they are looking okay the 1619 01:03:44,520 --> 01:03:46,119 engineering team right the engineering 1620 01:03:46,119 --> 01:03:48,720 team if they are looking for the highest 1621 01:03:48,720 --> 01:03:51,839 failure detection rate possible then 1622 01:03:51,839 --> 01:03:54,480 they should go with this particular 1623 01:03:54,480 --> 01:03:56,520 model okay 1624 01:03:56,520 --> 01:03:58,680 and if they want a balance between 1625 01:03:58,680 --> 01:04:01,039 precision and recall then they should 1626 01:04:01,039 --> 01:04:03,240 choose between the begging model with a 1627 01:04:03,240 --> 01:04:05,960 0.4 decision threshold or the random 1628 01:04:05,960 --> 01:04:09,599 forest model with a 0.5 threshold but if 1629 01:04:09,599 --> 01:04:11,880 they don't care so much about predicting 1630 01:04:11,880 --> 01:04:14,480 every failure and they want the highest 1631 01:04:14,480 --> 01:04:16,760 Precision possible then they should opt 1632 01:04:16,760 --> 01:04:19,799 for the begging toax link classifier 1633 01:04:19,799 --> 01:04:23,160 with a bit higher decision threshold and 1634 01:04:23,160 --> 01:04:26,160 so this is the key thing that the data 1635 01:04:26,160 --> 01:04:28,319 scientist is going to give right this is 1636 01:04:28,319 --> 01:04:30,760 the key takeaway this is the kind of the 1637 01:04:30,760 --> 01:04:32,680 end result of the entire machine 1638 01:04:32,680 --> 01:04:34,680 learning life cycle right now the data 1639 01:04:34,680 --> 01:04:36,400 scientist is going to tell the 1640 01:04:36,400 --> 01:04:38,599 engineering team all right you guys 1641 01:04:38,599 --> 01:04:41,160 which is more important for you point a 1642 01:04:41,160 --> 01:04:45,039 point B or Point C make your decision so 1643 01:04:45,039 --> 01:04:47,400 the engineering team will then discuss 1644 01:04:47,400 --> 01:04:48,960 among themselves and say hey you know 1645 01:04:48,960 --> 01:04:52,279 what what we want is we want to get the 1646 01:04:52,279 --> 01:04:54,720 highest failure detection possible 1647 01:04:54,720 --> 01:04:58,359 because any kind kind of failure of that 1648 01:04:58,359 --> 01:05:00,400 machine or the product on the samply 1649 01:05:00,400 --> 01:05:03,119 line is really going to screw us up big 1650 01:05:03,119 --> 01:05:05,640 time so what we're looking for is the 1651 01:05:05,640 --> 01:05:08,079 model that will give us the highest 1652 01:05:08,079 --> 01:05:10,880 failure detection rate we don't care 1653 01:05:10,880 --> 01:05:13,480 about Precision but we want to be make 1654 01:05:13,480 --> 01:05:15,440 sure that if there's a failure we are 1655 01:05:15,440 --> 01:05:17,720 going to catch it right so that's what 1656 01:05:17,720 --> 01:05:19,599 they want and so the data scientist will 1657 01:05:19,599 --> 01:05:22,200 say Hey you go for the balance begging 1658 01:05:22,200 --> 01:05:24,880 model okay then the data scientist saves 1659 01:05:24,880 --> 01:05:27,720 this all right uh and then once you have 1660 01:05:27,720 --> 01:05:30,000 saved this uh you can then go right 1661 01:05:30,000 --> 01:05:32,319 ahead and deploy that so you can go 1662 01:05:32,319 --> 01:05:33,520 right ahead and deploy that to 1663 01:05:33,520 --> 01:05:37,160 production okay and so if you want to 1664 01:05:37,160 --> 01:05:38,839 continue we can actually further 1665 01:05:38,839 --> 01:05:41,119 continue this modeling problem so just 1666 01:05:41,119 --> 01:05:43,480 now I model this problem as a binary 1667 01:05:43,480 --> 01:05:46,720 classification problem uh sorry just I 1668 01:05:46,720 --> 01:05:48,240 modeled this problem as a binary 1669 01:05:48,240 --> 01:05:49,520 classification which means it's either 1670 01:05:49,520 --> 01:05:51,680 zero or one either fail or not fail but 1671 01:05:51,680 --> 01:05:53,599 we can also model it as a multiclass 1672 01:05:53,599 --> 01:05:55,640 classification problem right because as 1673 01:05:55,640 --> 01:05:57,640 as I said earlier just now for the 1674 01:05:57,640 --> 01:06:00,200 Target variable colum which is sorry for 1675 01:06:00,200 --> 01:06:02,520 the failure type colume you actually 1676 01:06:02,520 --> 01:06:04,839 have multiple kinds of failures right 1677 01:06:04,839 --> 01:06:07,559 for example you may have a power failure 1678 01:06:07,559 --> 01:06:10,000 uh you may have a towar failure uh you 1679 01:06:10,000 --> 01:06:12,920 may have a overstrain failure so now we 1680 01:06:12,920 --> 01:06:14,839 can model the problem slightly 1681 01:06:14,839 --> 01:06:17,240 differently so we can model it as a 1682 01:06:17,240 --> 01:06:19,680 multiclass classification problem and 1683 01:06:19,680 --> 01:06:21,160 then we go through the entire same 1684 01:06:21,160 --> 01:06:22,680 process that we went through just now so 1685 01:06:22,680 --> 01:06:24,880 we create different models we test this 1686 01:06:24,880 --> 01:06:26,720 out but now the confusion Matrix is for 1687 01:06:26,720 --> 01:06:30,119 a multiclass classification isue right 1688 01:06:30,119 --> 01:06:30,960 so we're going 1689 01:06:30,960 --> 01:06:34,039 to check them out we're going to again 1690 01:06:34,039 --> 01:06:36,079 uh try different algorithms or models 1691 01:06:36,079 --> 01:06:38,039 again train and test our data set do the 1692 01:06:38,039 --> 01:06:39,760 training test split uh on these 1693 01:06:39,760 --> 01:06:42,000 different models all right so we have 1694 01:06:42,000 --> 01:06:43,400 like for example we have bon random 1695 01:06:43,400 --> 01:06:46,160 Forest B random Forest a great search 1696 01:06:46,160 --> 01:06:47,720 then you train the models using what is 1697 01:06:47,720 --> 01:06:49,680 called hyperparameter tuning then you 1698 01:06:49,680 --> 01:06:51,079 get the scores all right so you get the 1699 01:06:51,079 --> 01:06:53,160 same evaluation scores again you check 1700 01:06:53,160 --> 01:06:54,599 out the evaluation scores compare 1701 01:06:54,599 --> 01:06:57,079 between them generate a confusion Matrix 1702 01:06:57,079 --> 01:06:59,960 so this is a multiclass confusion Matrix 1703 01:06:59,960 --> 01:07:02,400 and then you come to the final 1704 01:07:02,400 --> 01:07:05,760 conclusion so now if you are interested 1705 01:07:05,760 --> 01:07:09,000 to frame your problem domain as a 1706 01:07:09,000 --> 01:07:11,359 multiclass classification problem all 1707 01:07:11,359 --> 01:07:13,839 right then these are the recommendations 1708 01:07:13,839 --> 01:07:15,480 from the data scientist so the data 1709 01:07:15,480 --> 01:07:17,240 scientist will say you know what I'm 1710 01:07:17,240 --> 01:07:19,559 going to pick this particular model the 1711 01:07:19,559 --> 01:07:22,039 balance backing classifier and these are 1712 01:07:22,039 --> 01:07:24,520 all the reasons that the data scientist 1713 01:07:24,520 --> 01:07:27,279 is going to give as a rational for 1714 01:07:27,279 --> 01:07:29,400 selecting this particular 1715 01:07:29,400 --> 01:07:32,039 model and then once that's done you save 1716 01:07:32,039 --> 01:07:35,000 the model and that's that's it that's it 1717 01:07:35,000 --> 01:07:38,920 so that's all done now and so then the 1718 01:07:38,920 --> 01:07:41,039 uh the model the machine learning model 1719 01:07:41,039 --> 01:07:43,720 now you can put it live run it on the 1720 01:07:43,720 --> 01:07:45,279 server and now the machine learning 1721 01:07:45,279 --> 01:07:47,200 model is ready to work which means it's 1722 01:07:47,200 --> 01:07:48,920 ready to generate predictions right 1723 01:07:48,920 --> 01:07:50,279 that's the main job of the machine 1724 01:07:50,279 --> 01:07:52,039 learning model you have picked the best 1725 01:07:52,039 --> 01:07:53,680 machine learning model with the best 1726 01:07:53,680 --> 01:07:55,799 evaluation metrics for whatever accur 1727 01:07:55,799 --> 01:07:57,760 see goal you're trying to achieve and 1728 01:07:57,760 --> 01:07:59,640 now you're going to run it on a server 1729 01:07:59,640 --> 01:08:00,799 and now you're going to get all this 1730 01:08:00,799 --> 01:08:02,960 real time data that's coming from your 1731 01:08:02,960 --> 01:08:04,520 sensus you're going to pump that into 1732 01:08:04,520 --> 01:08:06,359 your machine learning model your machine 1733 01:08:06,359 --> 01:08:07,880 learning model will pump out a whole 1734 01:08:07,880 --> 01:08:09,520 bunch of predictions and we're going to 1735 01:08:09,520 --> 01:08:12,799 use that predictions in real time to 1736 01:08:12,799 --> 01:08:15,400 make real time real world decision 1737 01:08:15,400 --> 01:08:17,560 making right you're going to say okay 1738 01:08:17,560 --> 01:08:19,600 I'm predicting that that machine is 1739 01:08:19,600 --> 01:08:23,198 going to fail on Thursday at 5:00 p.m. 1740 01:08:23,198 --> 01:08:25,520 so you better get your service folks in 1741 01:08:25,520 --> 01:08:28,640 to service it on Thursday 2: p.m. or you 1742 01:08:28,640 --> 01:08:31,640 know whatever so you can you know uh 1743 01:08:31,640 --> 01:08:33,479 make decisions on when you want to do 1744 01:08:33,479 --> 01:08:35,319 your maintenance you know and and make 1745 01:08:35,319 --> 01:08:37,640 the best decisions to optimize the cost 1746 01:08:37,640 --> 01:08:41,158 of Maintenance etc etc and then based on 1747 01:08:41,158 --> 01:08:42,120 the 1748 01:08:42,120 --> 01:08:45,000 results that are coming up from the 1749 01:08:45,000 --> 01:08:46,759 predictions so the predictions may be 1750 01:08:46,759 --> 01:08:49,120 good the predictions may be lousy the 1751 01:08:49,120 --> 01:08:51,359 predictions may be average right so we 1752 01:08:51,359 --> 01:08:53,719 are we're constantly monitoring how good 1753 01:08:53,719 --> 01:08:55,439 or how useful are the predictions 1754 01:08:55,439 --> 01:08:57,759 generated by this realtime model that's 1755 01:08:57,759 --> 01:08:59,880 running on the server and based on our 1756 01:08:59,880 --> 01:09:02,679 monitoring we will then take some new 1757 01:09:02,679 --> 01:09:05,319 data and then repeat this entire life 1758 01:09:05,319 --> 01:09:07,040 cycle again so this is basically a 1759 01:09:07,040 --> 01:09:09,238 workflow that's iterative and we are 1760 01:09:09,238 --> 01:09:11,120 constantly or the data scientist is 1761 01:09:11,120 --> 01:09:13,319 constantly getting in all these new data 1762 01:09:13,319 --> 01:09:15,279 points and then refining the model 1763 01:09:15,279 --> 01:09:17,960 picking maybe a new model deploying the 1764 01:09:17,960 --> 01:09:21,679 new model onto the server and so on all 1765 01:09:21,679 --> 01:09:23,920 right and so that's it so that is 1766 01:09:23,920 --> 01:09:26,399 basically your machine learning workflow 1767 01:09:26,399 --> 01:09:29,479 in a nutshell okay so for this 1768 01:09:29,479 --> 01:09:32,080 particular approach we have used a bunch 1769 01:09:32,080 --> 01:09:34,560 of uh data science libraries from python 1770 01:09:34,560 --> 01:09:36,520 so we have used pandas which is the most 1771 01:09:36,520 --> 01:09:38,560 B basic data science libraries that 1772 01:09:38,560 --> 01:09:40,279 provides all the tools to work with raw 1773 01:09:40,279 --> 01:09:42,520 data we have used numai which is a high 1774 01:09:42,520 --> 01:09:44,080 performance library for implementing 1775 01:09:44,080 --> 01:09:46,439 complex array metrix operations we have 1776 01:09:46,439 --> 01:09:49,560 used met plot lip and cbon which is used 1777 01:09:49,560 --> 01:09:52,439 for doing the Eda the explorat 1778 01:09:52,439 --> 01:09:55,560 exploratory data analysis phase machine 1779 01:09:55,560 --> 01:09:57,040 learning where you visualize all your 1780 01:09:57,040 --> 01:09:59,040 data we have used psyit learn which is 1781 01:09:59,040 --> 01:10:01,280 the machine L learning library to do all 1782 01:10:01,280 --> 01:10:02,920 your implementation for all your call 1783 01:10:02,920 --> 01:10:06,000 machine learning algorithms uh we we we 1784 01:10:06,000 --> 01:10:08,000 have not used this because this is not a 1785 01:10:08,000 --> 01:10:11,040 deep learning uh problem but if you are 1786 01:10:11,040 --> 01:10:12,800 working with a deep learning problem 1787 01:10:12,800 --> 01:10:15,360 like image classification image 1788 01:10:15,360 --> 01:10:17,840 recognition object detection okay 1789 01:10:17,840 --> 01:10:20,199 natural language processing text 1790 01:10:20,199 --> 01:10:21,920 classification well then you're going to 1791 01:10:21,920 --> 01:10:24,360 use these libraries from python which is 1792 01:10:24,360 --> 01:10:28,960 tensor flow okay and also py 1793 01:10:28,960 --> 01:10:32,679 to and then lastly that whole thing that 1794 01:10:32,679 --> 01:10:34,719 whole data science project that you saw 1795 01:10:34,719 --> 01:10:36,800 just now this entire data science 1796 01:10:36,800 --> 01:10:38,880 project is actually developed in 1797 01:10:38,880 --> 01:10:41,080 something called a Jupiter notebook so 1798 01:10:41,080 --> 01:10:44,040 all this python code along with all the 1799 01:10:44,040 --> 01:10:46,360 observations from the data 1800 01:10:46,360 --> 01:10:48,679 scientists okay for this entire data 1801 01:10:48,679 --> 01:10:50,440 science project was actually run in 1802 01:10:50,440 --> 01:10:53,360 something called a Jupiter notebook so 1803 01:10:53,360 --> 01:10:55,760 that is uh the 1804 01:10:55,760 --> 01:10:59,080 most widely used tool for interactively 1805 01:10:59,080 --> 01:11:02,360 developing and presenting data science 1806 01:11:02,360 --> 01:11:04,640 projects okay so that brings me to the 1807 01:11:04,640 --> 01:11:07,400 end of this entire presentation I hope 1808 01:11:07,400 --> 01:11:10,360 that you find it useful for you and that 1809 01:11:10,360 --> 01:11:13,199 you can appreciate the importance of 1810 01:11:13,199 --> 01:11:15,280 machine learning and how it can be 1811 01:11:15,280 --> 01:11:19,800 applied in a real life use case in a 1812 01:11:19,800 --> 01:11:23,360 typical production environment all right 1813 01:11:23,360 --> 01:11:27,239 thank you all so much for watching