0:00:01.199,0:00:03.760 Hello everyone, my name is Victor. I'm 0:00:03.760,0:00:05.359 your friendly neighborhood data 0:00:05.359,0:00:07.759 scientist from DreamCatcher. So in this 0:00:07.759,0:00:10.160 presentation, I would like to talk about 0:00:10.160,0:00:12.759 a specific industry use case of AI or 0:00:12.759,0:00:15.069 machine learning which is predictive 0:00:15.069,0:00:19.000 maintenance. So I will be covering these 0:00:19.000,0:00:21.320 topics and feel free to jump forward to 0:00:21.320,0:00:23.359 the specific part in the video where I 0:00:23.359,0:00:25.160 talk about all these topics. So I'm going 0:00:25.160,0:00:27.160 to start off with a general preview of 0:00:27.160,0:00:29.080 AI and machine learning. Then, I'll 0:00:29.080,0:00:30.840 discuss the use case which is predictive 0:00:30.840,0:00:32.719 maintenance. I'll talk about the basics 0:00:32.719,0:00:34.800 of machine learning, the workflow of 0:00:34.800,0:00:37.239 machine learning, and then we will come 0:00:37.239,0:00:40.760 to the meat of this presentation which 0:00:40.760,0:00:43.680 is essentially a demonstration of the 0:00:43.680,0:00:45.399 machine learning workflow from end to 0:00:45.399,0:00:47.579 end on a real life predictive 0:00:47.579,0:00:51.520 maintenance domain problem. All right, so 0:00:51.520,0:00:53.640 without any further ado, let's jump into 0:00:53.640,0:00:56.680 it. So let's start off with a quick 0:00:56.680,0:01:00.079 preview of AI and machine learning. Well 0:01:00.079,0:01:03.600 AI is a very general term, it encompasses 0:01:03.600,0:01:06.680 the entire area of science and 0:01:06.680,0:01:09.040 engineering that is related to creating 0:01:09.040,0:01:10.840 software programs and machines that 0:01:10.840,0:01:13.759 will be capable of performing tasks 0:01:13.759,0:01:16.080 that would normally require human 0:01:16.080,0:01:19.600 intelligence. But AI is a catchall term, 0:01:19.600,0:01:22.920 so really when we talk about apply AI, 0:01:22.920,0:01:25.920 how we use AI in our daily work, we are 0:01:25.920,0:01:27.720 really going to be talking about machine 0:01:27.720,0:01:30.000 learning. So machine learning is the 0:01:30.000,0:01:31.680 design and application of software 0:01:31.680,0:01:34.079 algorithms that are capable of learning 0:01:34.079,0:01:37.960 on their own without any explicit human 0:01:37.960,0:01:40.399 intervention. And the primary purpose of 0:01:40.399,0:01:43.280 these algorithms are to optimize 0:01:43.280,0:01:46.840 performance in a specific task. And the 0:01:46.840,0:01:49.680 primary performance or the primary task 0:01:49.680,0:01:52.000 that you want to optimize performance in 0:01:52.000,0:01:54.240 is to be able to make accurate 0:01:54.240,0:01:57.479 predictions about future outcomes based 0:01:57.479,0:02:00.560 on the analysis of historical data 0:02:00.560,0:02:02.960 from the past. So essentially machine 0:02:02.960,0:02:05.320 learning is about making predictions 0:02:05.320,0:02:06.880 about the future or what we call 0:02:06.880,0:02:08.919 predictive analytics. 0:02:08.919,0:02:11.000 And there are many different 0:02:11.000,0:02:12.720 kinds of algorithms that are available in 0:02:12.720,0:02:14.519 machine learning under the three primary 0:02:14.519,0:02:16.440 categories of supervised learning, 0:02:16.440,0:02:18.920 unsupervised learning, and reinforcement 0:02:18.920,0:02:21.440 learning. And here we can see some of the 0:02:21.440,0:02:23.560 different kinds of algorithms and their 0:02:23.560,0:02:27.480 use cases in various areas in 0:02:27.480,0:02:29.680 industry. So we have various domain use 0:02:29.680,0:02:30.480 cases 0:02:30.480,0:02:31.800 for all these different kind of 0:02:31.800,0:02:33.840 algorithms, and we can see that different 0:02:33.840,0:02:38.120 algorithms are fitted for different use cases. 0:02:38.120,0:02:41.000 Deep learning is an advanced form 0:02:41.000,0:02:42.400 of machine learning that's based on 0:02:42.400,0:02:44.280 something called an artificial neural 0:02:44.280,0:02:46.319 network or ANN for short, and this 0:02:46.319,0:02:47.840 essentially simulates the structure of 0:02:47.840,0:02:49.519 the human brain whereby neurons 0:02:49.519,0:02:51.360 interconnect and work together to 0:02:51.360,0:02:54.959 process and learn new information. So DL 0:02:54.959,0:02:57.239 is the foundational technology for most 0:02:57.239,0:02:59.360 of the popular AI tools that you 0:02:59.360,0:03:01.400 probably have heard of today. So I'm sure 0:03:01.400,0:03:03.200 you have heard of ChatGPT if you haven't 0:03:03.200,0:03:05.360 been living in a cave for the past 2 0:03:05.360,0:03:08.280 years. And yeah, so ChatGPT is an example 0:03:08.280,0:03:10.120 of what we call a large language model 0:03:10.120,0:03:11.599 and that's based on this technology 0:03:11.599,0:03:14.879 called deep learning. Also, all the modern 0:03:14.879,0:03:17.440 computer vision applications where a 0:03:17.440,0:03:20.040 computer program can classify images or 0:03:20.040,0:03:23.239 detect images or recognize images on 0:03:23.239,0:03:25.280 its own, okay, we call this computer 0:03:25.280,0:03:27.760 vision applications. They also use 0:03:27.760,0:03:29.519 this particular form of machine learning 0:03:29.519,0:03:31.560 called deep learning, right? So this is a 0:03:31.560,0:03:33.640 example of an artificial neural network. 0:03:33.640,0:03:35.200 For example, here I have an image of a 0:03:35.200,0:03:37.159 bird that's fed into this artificial 0:03:37.159,0:03:39.560 neural network, and output from this 0:03:39.560,0:03:41.239 artificial neural network is a 0:03:41.239,0:03:43.959 classification of this image into one of 0:03:43.959,0:03:46.400 these three potential categories. So in 0:03:46.400,0:03:49.080 this case, if the ANN has been trained 0:03:49.080,0:03:51.799 properly, we fit in this image, this 0:03:51.799,0:03:54.079 ANN should correctly classify this image 0:03:54.079,0:03:56.879 as a bird, right? So this is a image 0:03:56.879,0:03:58.959 classification problem which is a 0:03:58.959,0:04:01.079 classic use case for an artificial 0:04:01.079,0:04:03.929 neural network in the field of computer 0:04:03.929,0:04:07.879 vision. And just like in the case of 0:04:07.879,0:04:09.400 machine learning, there are a variety of 0:04:09.400,0:04:11.640 algorithms that are available for 0:04:11.640,0:04:13.599 deep learning under the category of 0:04:13.599,0:04:15.000 supervised learning and also 0:04:15.000,0:04:16.839 unsupervised learning. 0:04:16.839,0:04:19.199 All right, so this is how we can 0:04:19.199,0:04:20.839 kind of categorize this. You can think of 0:04:20.839,0:04:23.880 AI is a general area of smart systems 0:04:23.880,0:04:26.560 and machine. Machine learning is 0:04:26.560,0:04:29.360 basically apply AI and deep learning 0:04:29.360,0:04:29.823 is a 0:04:29.823,0:04:32.560 subspecialization of machine learning 0:04:32.560,0:04:35.000 using a particular architecture called 0:04:35.000,0:04:38.760 an artificial neural network. 0:04:38.760,0:04:42.160 And generative AI, so if you talk 0:04:42.160,0:04:45.280 about ChatGPT, okay, Google Gemini, 0:04:45.280,0:04:47.639 Microsoft Copilot, okay, all these 0:04:47.639,0:04:49.600 examples of generative AI, they are 0:04:49.600,0:04:51.600 basically large language models, and they 0:04:51.600,0:04:53.880 are a further subcategory within the 0:04:53.880,0:04:55.170 area of deep 0:04:55.170,0:04:57.759 learning. And there are many applications 0:04:57.759,0:04:59.400 of machine learning in industry right 0:04:59.400,0:05:01.759 now, so pick which particular industry 0:05:01.759,0:05:03.680 are you involved in, and these are all the 0:05:03.680,0:05:05.060 specific areas of 0:05:05.060,0:05:09.960 applications, right? So probably, I'm 0:05:09.960,0:05:11.680 going to guess the vast majority of you 0:05:11.680,0:05:12.880 who are watching this video, you're 0:05:12.880,0:05:14.360 probably coming from the manufacturing 0:05:14.360,0:05:16.639 industry, and so in the manufacturing 0:05:16.639,0:05:18.479 industry some of the standard use cases 0:05:18.479,0:05:20.039 for machine learning and deep learning 0:05:20.039,0:05:23.080 are predicting potential problems, okay? 0:05:23.080,0:05:25.319 So sometimes you call this predictive 0:05:25.319,0:05:27.160 maintenance where you want to predict 0:05:27.160,0:05:28.800 when a problem is going to happen and 0:05:28.800,0:05:30.400 then kind of address it before it 0:05:30.400,0:05:32.759 happens. And then monitoring systems, 0:05:32.759,0:05:35.199 automating your manufacturing assembly 0:05:35.199,0:05:37.880 line or production line, okay, smart 0:05:37.880,0:05:40.120 scheduling, and detecting anomaly on your 0:05:40.120,0:05:41.480 production line. 0:05:42.390,0:05:44.160 Okay, so let's talk about the use 0:05:44.160,0:05:45.680 case here which is predictive 0:05:45.680,0:05:49.280 maintenance, right? So what is predictive 0:05:49.280,0:05:51.720 maintenance? Well predictive maintenance, 0:05:51.720,0:05:53.199 here's the long definition, is a 0:05:53.199,0:05:54.639 equipment maintenance strategy that 0:05:54.639,0:05:56.280 relies on real-time monitoring of 0:05:56.280,0:05:58.360 equipment conditions and data to predict 0:05:58.360,0:06:00.280 equipment failures in advance. 0:06:00.280,0:06:02.680 And this uses advanced data models, 0:06:02.680,0:06:05.240 analytics, and machine learning whereby 0:06:05.240,0:06:07.479 we can reliably assess when failures are 0:06:07.479,0:06:09.199 more likely to occur, including which 0:06:09.199,0:06:11.120 components are more likely to be 0:06:11.120,0:06:13.560 affected on your production or assembly 0:06:13.560,0:06:16.599 line. So where does predictive 0:06:16.599,0:06:18.759 maintenance fit into the overall scheme 0:06:18.759,0:06:20.759 of things, right? So let's talk about the 0:06:20.759,0:06:23.039 kind of standard way that, you know, 0:06:23.039,0:06:25.520 factories or production 0:06:25.520,0:06:27.680 lines, assembly lines in factories tend 0:06:27.680,0:06:31.080 to handle maintenance issues say 0:06:31.080,0:06:33.120 10 or 20 years ago, right? So what you 0:06:33.120,0:06:34.520 have is the, what you would probably 0:06:34.520,0:06:36.400 start off is the most basic mode 0:06:36.400,0:06:38.240 which is reactive maintenance. So you 0:06:38.240,0:06:40.680 just wait until your machine breaks down 0:06:40.680,0:06:43.039 and then you repair, right? The simplest, 0:06:43.039,0:06:44.720 but, of course, I'm sure if you have worked on a 0:06:44.720,0:06:46.720 production line for any period of time, 0:06:46.720,0:06:48.880 you know that this reactive maintenance 0:06:48.880,0:06:50.759 can give you a whole bunch of headaches 0:06:50.759,0:06:52.160 especially if the machine breaks down 0:06:52.160,0:06:54.120 just before a critical delivery deadline, 0:06:54.120,0:06:55.520 right? Then you're going to have a 0:06:55.520,0:06:56.800 backlog of orders and you're going to 0:06:56.800,0:06:59.160 run to a lot of problems. Okay, so we move on 0:06:59.160,0:07:00.879 to preventive maintenance which is 0:07:00.879,0:07:03.840 you regularly schedule a maintenance of 0:07:03.840,0:07:07.000 your production machines to reduce 0:07:07.000,0:07:08.800 the failure rate. So you might do 0:07:08.800,0:07:10.520 maintenance once every month, once every 0:07:10.520,0:07:13.120 two weeks, whatever. Okay, this is great, 0:07:13.120,0:07:15.240 but the problem, of course, then is well 0:07:15.240,0:07:16.199 sometimes you're doing too much 0:07:16.199,0:07:18.400 maintenance, it's not really necessary, 0:07:18.400,0:07:20.639 and it still doesn't totally prevent 0:07:20.639,0:07:23.240 this, you know, a failure of the 0:07:23.240,0:07:25.639 machine that occurs outside of your planned 0:07:25.639,0:07:28.680 maintenance, right? So a bit of an 0:07:28.680,0:07:31.160 improvement, but not that much better. 0:07:31.160,0:07:33.280 And then, these last two categories is 0:07:33.280,0:07:34.680 where we bring in AI and machine 0:07:34.680,0:07:36.759 learning. So with machine learning, we're 0:07:36.759,0:07:39.280 going to use sensors to do real-time 0:07:39.280,0:07:41.759 monitoring of the data, and then using 0:07:41.759,0:07:43.319 that data we're going to build a machine 0:07:43.319,0:07:46.479 learning model which helps us to predict, 0:07:46.479,0:07:50.000 with a reasonable level of accuracy, when 0:07:50.000,0:07:52.520 the next failure is going to happen on 0:07:52.520,0:07:54.440 your assembly or production line on a 0:07:54.440,0:07:57.440 specific component or specific machine, 0:07:57.440,0:07:59.520 right? So you just want to be predict to 0:07:59.520,0:08:01.960 a high level of accuracy like maybe 0:08:01.960,0:08:04.440 to the specific day, even the specific 0:08:04.440,0:08:06.400 hour, or even minute itself when you 0:08:06.400,0:08:08.360 expect that particular product to fail 0:08:08.360,0:08:10.960 or the particular machine to fail. All 0:08:10.960,0:08:12.639 right, so these are the advantages of 0:08:12.639,0:08:14.879 predictive maintenance. It minimizes 0:08:14.879,0:08:16.720 the occurrence of unscheduled downtime, it 0:08:16.720,0:08:18.080 gives you a real-time overview of your 0:08:18.080,0:08:19.919 current condition of assets, ensures 0:08:19.919,0:08:22.680 minimal disruptions to productivity, 0:08:22.680,0:08:24.720 optimizes time you spend on maintenance work, 0:08:24.720,0:08:26.639 optimizes the use of spare parts, and so 0:08:26.639,0:08:28.280 on. And of course there are some 0:08:28.280,0:08:30.639 disadvantages, which is the 0:08:30.639,0:08:32.559 primary one, you need a specialized set 0:08:32.559,0:08:35.519 of skills among your engineers to 0:08:35.519,0:08:37.719 understand and create machine learning 0:08:37.719,0:08:40.599 models that can work on the real-time 0:08:40.599,0:08:43.559 data that you're getting. Okay, so we're 0:08:43.559,0:08:45.000 going to take a look at some real life 0:08:45.000,0:08:47.200 use cases. So these are a bunch of links 0:08:47.200,0:08:48.720 here, so if you navigate to these links 0:08:48.720,0:08:50.120 here, you'll be able to get a look at 0:08:50.120,0:08:54.360 some real life use cases of machine 0:08:54.360,0:08:57.640 learning in predictive maintenance. So 0:08:57.640,0:09:00.959 the IBM website, okay, gives you a look at 0:09:00.959,0:09:04.880 a bunch of five use cases, so you can 0:09:04.880,0:09:06.519 click on these links and follow up with 0:09:06.519,0:09:08.279 them if you want to read more. Okay, this 0:09:08.279,0:09:11.480 is waste management, manufacturing, okay, 0:09:11.480,0:09:14.760 building services, and renewable energy, 0:09:14.760,0:09:16.880 and also mining, right? So these are all 0:09:16.880,0:09:18.279 use cases, if you want to know more about 0:09:18.279,0:09:20.480 them, you can read up and follow them 0:09:20.480,0:09:23.600 from this website. And this website 0:09:23.600,0:09:25.760 gives, this is a pretty good website. I 0:09:25.760,0:09:27.720 would really encourage you to just look 0:09:27.720,0:09:28.880 through this if you're interested in 0:09:28.880,0:09:31.160 predictive maintenance. So here, it tells 0:09:31.160,0:09:34.279 you about, you know, an industry survey of 0:09:34.279,0:09:36.360 predictive maintenance. We can see that a 0:09:36.360,0:09:38.200 large portion of the industry, 0:09:38.200,0:09:39.680 manufacturing industry agreed that 0:09:39.680,0:09:41.360 predictive maintenance is a real need to 0:09:41.360,0:09:43.959 stay competitive and predictive 0:09:43.959,0:09:45.240 maintenance is essential for 0:09:45.240,0:09:46.720 manufacturing industry and will gain 0:09:46.720,0:09:48.279 additional strength in the future. So 0:09:48.279,0:09:50.200 this is a survey that was done quite 0:09:50.200,0:09:52.040 some time ago and this was the results 0:09:52.040,0:09:53.880 that we got back. So we can see the vast 0:09:53.880,0:09:55.720 majority of key industry players in the 0:09:55.720,0:09:57.640 manufacturing sector, they consider 0:09:57.640,0:09:59.000 predictive maintenance to be a very 0:09:59.000,0:09:59.839 important 0:09:59.839,0:10:01.600 activity that they want to 0:10:01.600,0:10:04.519 incorporate into their workflow, right? 0:10:04.519,0:10:07.720 And we can see here the kind of ROI that 0:10:07.720,0:10:10.680 we expect on investment in predictive 0:10:10.680,0:10:13.399 maintenance, so 45% reduction in downtime, 0:10:13.399,0:10:17.120 25% growth in productivity, 75% fault 0:10:17.120,0:10:19.480 elimination, 30% reduction in maintenance 0:10:19.480,0:10:22.640 cost, okay? And best of all, if you really 0:10:22.640,0:10:25.040 want to kind of take a look at examples, 0:10:25.040,0:10:26.680 all right, so there are all these 0:10:26.680,0:10:28.120 different companies that have 0:10:28.120,0:10:30.160 significantly invested in predictive 0:10:30.160,0:10:31.640 maintenance technology in their 0:10:31.640,0:10:34.240 manufacturing processes. So PepsiCo, we 0:10:34.240,0:10:38.965 have got Frito-Lay, General Motors, Mondi, Ecoplant, 0:10:38.965,0:10:40.959 all right? So you can jump over here 0:10:40.959,0:10:42.959 and take a look at some of these 0:10:42.959,0:10:46.040 use cases. Let me perhaps, let me try and 0:10:46.040,0:10:48.079 open this up, for example, Mondi, right? You 0:10:48.079,0:10:51.880 can see Mondi has impl- oops. Mondi has used 0:10:51.880,0:10:53.720 this particular piece of software 0:10:53.720,0:10:55.839 called MATLAB, all right, or MathWorks 0:10:55.839,0:10:59.760 sorry, to do predictive maintenance 0:10:59.760,0:11:01.920 for their manufacturing processes using 0:11:01.920,0:11:05.079 machine learning. And we can talk, you can 0:11:05.079,0:11:07.680 study how they have used it, all right, 0:11:07.680,0:11:09.000 and how it works, what was their 0:11:09.000,0:11:10.920 challenge, all right, the problems they 0:11:10.920,0:11:12.639 were facing, the solution that they use 0:11:12.639,0:11:14.560 using this MathWorks Consulting piece of 0:11:14.560,0:11:17.160 software, and data that they collected in 0:11:17.160,0:11:20.399 a MATLAB database, all right, sorry 0:11:20.399,0:11:23.639 in a Oracle database. 0:11:23.639,0:11:26.399 So using MathWorks from MATLAB, all 0:11:26.399,0:11:27.959 right, they were able to create a deep 0:11:27.959,0:11:30.560 learning model to, you know, to 0:11:30.560,0:11:32.839 solve this particular issue for their 0:11:32.839,0:11:35.720 domain. So if you're interested, please, I 0:11:35.720,0:11:37.639 strongly encourage you to read up on all 0:11:37.639,0:11:40.440 these real life customer stories with 0:11:40.440,0:11:43.403 showcase use cases for predictive 0:11:43.403,0:11:48.240 maintenance. Okay, so that's it for 0:11:48.240,0:11:52.200 real life use cases for predictive maintenance. 0:11:53.819,0:11:56.600 Now in this topic, I'm 0:11:56.600,0:11:58.000 going to talk about machine learning 0:11:58.000,0:12:00.040 basics, so what is actually involved 0:12:00.040,0:12:01.480 in machine learning, and I'm going to 0:12:01.480,0:12:03.839 give a very quick, fast, conceptual, high 0:12:03.839,0:12:05.920 level overview of machine learning, all 0:12:05.920,0:12:09.000 right? So there are several categories of 0:12:09.000,0:12:10.959 machine learning, supervised, unsupervised, 0:12:10.959,0:12:13.000 semi-supervised, reinforcement, and deep 0:12:13.000,0:12:15.880 learning, okay? And let's talk about the 0:12:15.880,0:12:19.360 most common and widely used category of 0:12:19.360,0:12:20.560 machine learning which is called 0:12:20.560,0:12:25.040 supervised learning. So the particular use 0:12:25.040,0:12:26.279 case here that I'm going to be 0:12:26.279,0:12:28.560 discussing, predictive maintenance, it's 0:12:28.560,0:12:31.320 basically a form of supervised learning. 0:12:31.320,0:12:33.480 So how does supervised learning work? 0:12:33.480,0:12:35.199 Well in supervised learning, you're going 0:12:35.199,0:12:37.240 to create a machine learning model by 0:12:37.240,0:12:39.360 providing what is called a labelled data 0:12:39.360,0:12:41.680 set as a input to a machine learning 0:12:41.680,0:12:44.680 program or algorithm. And this dataset 0:12:44.680,0:12:46.440 is going to contain what is called an 0:12:46.440,0:12:48.760 independent or feature variables, all 0:12:48.760,0:12:51.240 right, so this will be a set of variables. 0:12:51.240,0:12:52.959 And there will be one dependent or 0:12:52.959,0:12:54.959 target variable which we also call the 0:12:54.959,0:12:57.720 label, and the idea is that the 0:12:57.720,0:12:59.839 independent or the feature variables are 0:12:59.839,0:13:01.600 the attributes or properties of your 0:13:01.600,0:13:04.160 data set that influence the dependent or 0:13:04.160,0:13:07.760 the target variable, okay? So this process 0:13:07.760,0:13:09.120 that I've just described is called 0:13:09.120,0:13:11.600 training the machine learning model, and 0:13:11.600,0:13:14.279 the model is fundamentally a 0:13:14.279,0:13:16.399 mathematical function that best 0:13:16.399,0:13:18.399 approximates the relationship between 0:13:18.399,0:13:20.639 the independent variables and the 0:13:20.639,0:13:22.639 dependent variable. All right, so that's 0:13:22.639,0:13:24.480 quite a bit of a mouthful, so let's jump 0:13:24.480,0:13:26.320 into a diagram that maybe illustrates 0:13:26.320,0:13:27.880 this more clearly. So let's say you have 0:13:27.880,0:13:30.000 a dataset here, an Excel spreadsheet, 0:13:30.000,0:13:32.160 right? And this Excel spreadsheet has a 0:13:32.160,0:13:34.040 bunch of columns here and a bunch of 0:13:34.040,0:13:36.800 rows, okay? So these rows here represent 0:13:36.800,0:13:39.000 observations, or these rows are what 0:13:39.000,0:13:40.959 we call observations or samples or data 0:13:40.959,0:13:43.120 points in our data set, okay? So let's 0:13:43.120,0:13:46.880 assume this data set is gathered by a 0:13:46.880,0:13:49.959 marketing manager at a mall, at a retail 0:13:49.959,0:13:52.279 mall, all right? So they've got all this 0:13:52.279,0:13:54.920 information about the customers who 0:13:54.920,0:13:56.800 purchase products at this mall, all right? 0:13:56.800,0:13:58.519 So some of the information they've 0:13:58.519,0:14:00.000 gotten about the customers are their 0:14:00.000,0:14:01.839 gender, their age, their income, and the 0:14:01.839,0:14:03.600 number of children. So all this 0:14:03.600,0:14:05.680 information about the customers, we call 0:14:05.680,0:14:07.360 this the independent or the feature 0:14:07.360,0:14:10.079 variables, all right? And based on all 0:14:10.079,0:14:12.759 this information about the customer, we 0:14:12.759,0:14:16.199 also managed to get some or we record 0:14:16.199,0:14:17.600 the information about how much the 0:14:17.600,0:14:20.480 customer spends, all right? So this 0:14:20.480,0:14:22.079 information or these numbers here, we call 0:14:22.079,0:14:23.839 this the target variable or the 0:14:23.839,0:14:26.600 dependent variable, right? So on the 0:14:26.600,0:14:29.519 single row, the data point, one single sample, one 0:14:29.519,0:14:32.560 single data point, contains all the data 0:14:32.560,0:14:35.040 for the feature variables and one single 0:14:35.040,0:14:37.800 value for the label or the target 0:14:37.800,0:14:41.199 variable, okay? And the primary purpose of 0:14:41.199,0:14:43.240 the machine learning model is to create 0:14:43.240,0:14:45.519 a mapping from all your feature 0:14:45.519,0:14:48.160 variables to your target variable, so 0:14:48.160,0:14:50.920 somehow there's going to be a function, 0:14:50.920,0:14:52.160 okay, this will be a mathematical 0:14:52.160,0:14:54.800 function that maps all the values of 0:14:54.800,0:14:57.040 your feature variable to the value of 0:14:57.040,0:14:59.639 your target variable. In other words, this 0:14:59.639,0:15:01.279 function represents the relationship 0:15:01.279,0:15:03.360 between your feature variables and your 0:15:03.360,0:15:07.079 target variable, okay? So this whole thing, 0:15:07.079,0:15:08.560 this training process, we call this the 0:15:08.560,0:15:11.320 fitting the model. And the target 0:15:11.320,0:15:13.240 variable or the label, this thing here, 0:15:13.240,0:15:15.120 this column here, or the values here, 0:15:15.120,0:15:17.399 these are critical for providing a 0:15:17.399,0:15:19.000 context to do the fitting or the 0:15:19.000,0:15:21.160 training of the model. And once you've 0:15:21.160,0:15:23.360 got a trained and fitted model, you can 0:15:23.360,0:15:25.959 then use the model to make an accurate 0:15:25.959,0:15:28.319 prediction of target values 0:15:28.319,0:15:30.240 corresponding to new feature values that 0:15:30.240,0:15:32.519 the model has yet to encounter or yet to 0:15:32.519,0:15:34.759 see, and this, as I've already said 0:15:34.759,0:15:36.240 earlier, this is called predictive 0:15:36.240,0:15:38.480 analytics, okay? So let's see what's 0:15:38.480,0:15:40.120 actually happening here, you take your 0:15:40.120,0:15:43.079 training data, all right, so this is this 0:15:43.079,0:15:44.880 whole bunch of data, this data set here 0:15:44.880,0:15:47.440 consisting of a thousand rows of 0:15:47.440,0:15:49.920 data, 10,000 rows of data, you take this 0:15:49.920,0:15:52.040 entire data set, all right, this entire 0:15:52.040,0:15:54.000 data set, you jam it into your machine 0:15:54.000,0:15:56.519 learning algorithm, and a couple of hours 0:15:56.519,0:15:58.079 later your machine learning algorithm 0:15:58.079,0:16:01.360 comes up with a model. And the model is 0:16:01.360,0:16:04.199 essentially a function that maps all 0:16:04.199,0:16:05.959 your feature variables which is these 0:16:05.959,0:16:08.199 four columns here, to your target 0:16:08.199,0:16:10.440 variable which is this one single column 0:16:10.440,0:16:14.279 here, okay? So once you have the model, you 0:16:14.279,0:16:17.040 can put in a new data point. So basically 0:16:17.040,0:16:19.079 the new data point represents data about a 0:16:19.079,0:16:20.959 new customer, a new customer that you 0:16:20.959,0:16:23.120 have never seen before. So let's say 0:16:23.120,0:16:25.079 you've already got information about 0:16:25.079,0:16:27.560 10,000 customers that have visited this 0:16:27.560,0:16:29.920 mall and how much each of these 10,000 0:16:29.920,0:16:31.519 customers have spent when they are at this 0:16:31.519,0:16:34.040 mall. So now you have a totally new 0:16:34.040,0:16:35.800 customer that comes in the mall, this 0:16:35.800,0:16:37.800 customer has never come into this mall 0:16:37.800,0:16:39.839 before, and what we know about this 0:16:39.839,0:16:42.680 customer is that he is a male, the age is 0:16:42.680,0:16:45.199 50, the income is 18, and they have nine 0:16:45.199,0:16:48.160 children. So now when you take this data 0:16:48.160,0:16:50.519 and you pump that into your model, your 0:16:50.519,0:16:52.920 model is going to make a prediction, it's 0:16:52.920,0:16:55.720 going to say, hey, you know what? Based on 0:16:55.720,0:16:57.279 everything that I have been trained before 0:16:57.279,0:16:59.360 and based on the model I've developed, 0:16:59.360,0:17:01.959 I am going to predict that a customer 0:17:01.959,0:17:04.880 that is of a male gender, of the age 50 0:17:04.880,0:17:08.280 with the income of 18, and nine children, 0:17:08.280,0:17:12.400 that customer is going to spend 25 ringgit 0:17:12.400,0:17:15.839 at the mall. And this is it, this is what 0:17:15.839,0:17:18.599 you want. Right there, right here, 0:17:18.599,0:17:21.319 can you see here? That is the final 0:17:21.319,0:17:23.480 output of your machine learning model. 0:17:23.480,0:17:27.359 It's going to make a prediction about 0:17:27.359,0:17:29.760 something that it has not ever seen 0:17:29.760,0:17:32.919 before, okay? That is the core, this is 0:17:32.919,0:17:35.520 essentially the core of machine learning. 0:17:35.520,0:17:38.640 Predictive analytics, making prediction 0:17:38.640,0:17:40.120 about the future 0:17:41.170,0:17:43.799 based on a historical data set. 0:17:44.379,0:17:47.440 Okay, so there are two areas of 0:17:47.440,0:17:49.480 supervised learning, regression and 0:17:49.480,0:17:51.400 classification. So regression is used to 0:17:51.400,0:17:53.440 predict a numerical target variable, such 0:17:53.440,0:17:55.320 as the price of a house or the salary of 0:17:55.320,0:17:57.799 an employee, whereas classification is 0:17:57.799,0:17:59.919 used to predict a categorical target 0:17:59.919,0:18:03.559 variable or class label, okay? So for 0:18:03.559,0:18:05.799 classification you can have either 0:18:05.799,0:18:08.679 binary or multiclass, so, for example, 0:18:08.679,0:18:11.559 binary will be just true or false, zero 0:18:11.559,0:18:14.840 or one. So whether your machine is going 0:18:14.840,0:18:17.360 to fail or is it not going to fail, right? 0:18:17.360,0:18:19.000 So just two classes, two possible, 0:18:19.000,0:18:21.640 outcomes, or is the customer going to 0:18:21.640,0:18:23.679 make a purchase or is the customer not 0:18:23.679,0:18:26.159 going to make a purchase. We call this 0:18:26.159,0:18:28.120 binary classification. And then for 0:18:28.120,0:18:29.679 multiclass, when there are more than two 0:18:29.679,0:18:32.559 classes or types of values. So, for 0:18:32.559,0:18:34.039 example, here this would be a 0:18:34.039,0:18:35.760 classification problem. So if you have a 0:18:35.760,0:18:37.960 data set here, you've got information 0:18:37.960,0:18:39.360 about your customers, you've got your 0:18:39.360,0:18:41.159 gender of the customer, the age of the 0:18:41.159,0:18:42.919 customer, the salary of the customer, and 0:18:42.919,0:18:44.640 you also have record about whether the 0:18:44.640,0:18:47.679 customer made a purchase or not, okay? So 0:18:47.679,0:18:50.080 you can take this data set to train a 0:18:50.080,0:18:52.440 classification model, and then the 0:18:52.440,0:18:54.120 classification model can then make a 0:18:54.120,0:18:56.320 prediction about a new customer, and 0:18:56.320,0:18:58.799 they're going to predict zero which 0:18:58.799,0:19:00.480 means the customer didn't make a 0:19:00.480,0:19:03.159 purchase or one which means the customer 0:19:03.159,0:19:06.320 make a purchase, right? And regression, 0:19:06.320,0:19:08.600 this is regression, so let's say you want 0:19:08.600,0:19:11.280 to predict the wind speed, and you've got 0:19:11.280,0:19:13.799 historical data about all these four 0:19:13.799,0:19:16.559 other independent variables or feature 0:19:16.559,0:19:18.039 variables, so you have recorded 0:19:18.039,0:19:19.640 temperature, the pressure, the relative 0:19:19.640,0:19:21.799 humidity, and the wind direction for the 0:19:21.799,0:19:24.799 past 10 days, 15 days, or whatever, okay? So 0:19:24.799,0:19:26.760 now you are going to train your machine 0:19:26.760,0:19:28.720 learning model using this data set, and 0:19:28.720,0:19:31.679 the target variable column, okay, this 0:19:31.679,0:19:33.760 column here, the label is basically a 0:19:33.760,0:19:37.080 number, right? So now with this number, 0:19:37.080,0:19:39.600 this is a regression model, and so now 0:19:39.600,0:19:41.760 you can put in a new data point, so a new 0:19:41.760,0:19:45.080 data point means a new set of values for 0:19:45.080,0:19:46.960 temperature, pressure, relative humidity, 0:19:46.960,0:19:48.600 and wind direction, and your machine 0:19:48.600,0:19:50.679 learning model will then predict the 0:19:50.679,0:19:53.640 wind speed for that new data point, okay? 0:19:53.640,0:19:57.480 So that's a regression model. 0:19:59.159,0:20:02.280 All right. So in this particular topic 0:20:02.280,0:20:04.919 I'm going to talk about the workflow of 0:20:04.919,0:20:07.960 that's involved in machine learning. So 0:20:07.960,0:20:12.640 in the previous slides, I talked about 0:20:12.640,0:20:14.600 developing the model, all right? But 0:20:14.600,0:20:16.360 that's just one part of the entire 0:20:16.360,0:20:19.080 workflow. So in real life when you use 0:20:19.080,0:20:20.480 machine learning, there's an end-to-end 0:20:20.480,0:20:22.480 workflow that's involved. So the first 0:20:22.480,0:20:24.159 thing, of course, is you need to get your 0:20:24.159,0:20:26.880 data, and then you need to clean your 0:20:26.880,0:20:29.000 data, and then you need to explore your 0:20:29.000,0:20:30.799 data. You need to see what's going on in 0:20:30.799,0:20:33.280 your data set, right? And your data set, 0:20:33.280,0:20:35.720 real life data sets are not trivial, they 0:20:35.720,0:20:38.760 are hundreds of rows, thousands of rows, 0:20:38.760,0:20:40.640 sometimes millions of rows, billions of 0:20:40.640,0:20:43.080 rows, we're talking about billions or 0:20:43.080,0:20:45.120 millions of data points especially if 0:20:45.120,0:20:47.120 you're using an IoT sensor to get data 0:20:47.120,0:20:49.000 in real time. So you've got all these 0:20:49.000,0:20:51.320 super large data sets, you need to clean 0:20:51.320,0:20:53.400 them, and explore them, and then you need 0:20:53.400,0:20:56.360 to prepare them into a right format so 0:20:56.360,0:20:59.600 that you can put them into the training 0:20:59.600,0:21:01.520 process to create your machine learning 0:21:01.520,0:21:04.799 model, and then subsequently you check 0:21:04.799,0:21:07.559 how good is the model, right? How accurate 0:21:07.559,0:21:10.080 is the model in terms of its ability to 0:21:10.080,0:21:12.559 generate predictions for the 0:21:12.559,0:21:14.960 future, right? How accurate are the 0:21:14.960,0:21:16.679 predictions that are coming up from your 0:21:16.679,0:21:18.400 machine learning model. So that's 0:21:18.400,0:21:20.760 validating or evaluating your model, and 0:21:20.760,0:21:22.559 then subsequently if you determine that 0:21:22.559,0:21:25.400 your model is of adequate accuracy to 0:21:25.400,0:21:27.240 meet whatever your domain use case 0:21:27.240,0:21:29.400 requirements are, right? So let's say the 0:21:29.400,0:21:31.440 accuracy that's required for your domain 0:21:31.440,0:21:32.440 use case is 0:21:32.440,0:21:35.320 85%, okay? If my machine learning model 0:21:35.320,0:21:38.520 can give an 85% accuracy rate, I think 0:21:38.520,0:21:40.159 it's good enough, then I'm going to 0:21:40.159,0:21:42.880 deploy it into real world use case. So 0:21:42.880,0:21:45.000 here the machine learning model gets 0:21:45.000,0:21:48.440 deployed on the server, and then other, 0:21:48.440,0:21:50.760 you know, other data sources are going to 0:21:50.760,0:21:52.559 be captured from somewhere. That data is 0:21:52.559,0:21:54.200 pump into the machine learning model. The 0:21:54.200,0:21:55.440 machine learning model generates 0:21:55.440,0:21:57.760 predictions, and those predictions are 0:21:57.760,0:21:59.600 then used to make decisions on the 0:21:59.600,0:22:02.000 factory floor in real time or in any 0:22:02.000,0:22:04.559 other particular scenario. And then you 0:22:04.559,0:22:06.840 constantly monitor and update the model, 0:22:06.840,0:22:09.360 you get more new data, and then the 0:22:09.360,0:22:11.960 entire cycle repeats itself. So that's 0:22:11.960,0:22:14.480 your machine learning workflow, okay, in a 0:22:14.480,0:22:16.919 nutshell. Here's another example of 0:22:16.919,0:22:18.520 the same thing maybe in a slightly 0:22:18.520,0:22:20.039 different format, so, again, you have your 0:22:20.039,0:22:22.159 data collection and preparation. Here we 0:22:22.159,0:22:24.360 talk more about the different kinds of 0:22:24.360,0:22:26.520 algorithms that available to create a 0:22:26.520,0:22:28.120 model, and I'll talk about this more in 0:22:28.120,0:22:30.000 detail when we look at the real world 0:22:30.000,0:22:32.320 example of a end-to-end machine learning 0:22:32.320,0:22:34.559 workflow for the predictive maintenance 0:22:34.559,0:22:36.880 use case. So once you have chosen the 0:22:36.880,0:22:38.840 appropriate algorithm, you then have 0:22:38.840,0:22:41.240 trained your model, you then have 0:22:41.240,0:22:44.080 selected the appropriate train model 0:22:44.080,0:22:46.440 among the multiple models. You are 0:22:46.440,0:22:47.799 probably going to develop multiple 0:22:47.799,0:22:49.559 models from multiple algorithms, you're 0:22:49.559,0:22:51.679 going to evaluate them all, and then 0:22:51.679,0:22:53.200 you're going to say, hey, you know what? 0:22:53.200,0:22:55.279 After I've evaluated and tested that, 0:22:55.279,0:22:57.480 I've chosen the best model, I'm going to 0:22:57.480,0:22:59.640 deploy the model, all right, so this is 0:22:59.640,0:23:02.640 for real life production use, okay? Real 0:23:02.640,0:23:04.279 life sensor data is going to be pumped 0:23:04.279,0:23:06.039 into my model, my model is going to 0:23:06.039,0:23:08.039 generate predictions, the predicted data 0:23:08.039,0:23:10.120 is going to used immediately in real 0:23:10.120,0:23:12.840 time for real life decision making, and 0:23:12.840,0:23:15.000 then I'm going to monitor, right, the 0:23:15.000,0:23:17.440 results. So somebody's using the 0:23:17.440,0:23:19.279 predictions from my model, if the 0:23:19.279,0:23:21.880 predictions are lousy, that goes into the 0:23:21.880,0:23:23.440 monitoring, the monitoring system 0:23:23.440,0:23:25.279 captures that. If the predictions are 0:23:25.279,0:23:27.720 fantastic, well that is also captured by the 0:23:27.720,0:23:29.799 monitoring system, and that gets 0:23:29.799,0:23:32.360 feedback again to the next cycle of my 0:23:32.360,0:23:33.679 machine learning 0:23:33.679,0:23:35.960 pipeline. Okay, so that's the kind of 0:23:35.960,0:23:38.360 overall view, and here are the kind of 0:23:38.360,0:23:41.559 key phases of your workflow. So one of 0:23:41.559,0:23:43.960 the important phases is called EDA, 0:23:43.960,0:23:47.520 exploratory data analysis and in this 0:23:47.520,0:23:49.880 particular phase, you're going to 0:23:49.880,0:23:53.120 do a lot of stuff, primarily just to 0:23:53.120,0:23:54.880 understand your data set. So like I said, 0:23:54.880,0:23:56.559 real life data sets, they tend to be very 0:23:56.559,0:23:59.320 complex, and they tend to have various 0:23:59.320,0:24:01.039 statistical properties, all right, 0:24:01.039,0:24:02.679 statistics is a very important component 0:24:02.679,0:24:05.600 of machine learning. So an EDA helps you 0:24:05.600,0:24:07.480 to kind of get an overview of your data 0:24:07.480,0:24:09.679 set, get an overview of any problems in 0:24:09.679,0:24:11.520 your data set like any data that's 0:24:11.520,0:24:13.440 missing, the statistical properties of your 0:24:13.440,0:24:15.159 data set, the distribution of your data 0:24:15.159,0:24:17.279 set, the statistical correlation of 0:24:17.279,0:24:19.190 variables in your data set, etc, 0:24:19.190,0:24:23.400 etc. Okay, then we have data cleaning or 0:24:23.400,0:24:25.279 sometimes you call it data cleansing, and 0:24:25.279,0:24:27.600 in this phase what you want to do is 0:24:27.600,0:24:29.440 primarily, you want to kind of do things 0:24:29.440,0:24:31.960 like remove duplicate records or rows in 0:24:31.960,0:24:33.679 your table, you want to make sure that 0:24:33.679,0:24:36.799 your data or your data 0:24:36.799,0:24:39.399 points or your samples have appropriate IDs, 0:24:39.399,0:24:41.080 and most importantly, you want to make 0:24:41.080,0:24:43.039 sure there's not too many missing values 0:24:43.039,0:24:44.880 in your data set. So what I mean by 0:24:44.880,0:24:46.320 missing values are things like that, 0:24:46.320,0:24:48.200 right? You have got a data set, and for 0:24:48.200,0:24:51.640 some reason there are some cells or 0:24:51.640,0:24:54.559 locations in your data set which are 0:24:54.559,0:24:56.520 missing values, right? And if you have a 0:24:56.520,0:24:58.679 lot of these missing values, then you've 0:24:58.679,0:25:00.440 got a poor quality data set, and you're 0:25:00.440,0:25:02.200 not going to be able to build a good 0:25:02.200,0:25:04.159 model from this data set. You're not 0:25:04.159,0:25:06.000 going to be able to train a good machine 0:25:06.000,0:25:08.120 learning model from a data set with a 0:25:08.120,0:25:10.200 lot of missing values like this. So you 0:25:10.200,0:25:11.880 have to figure out whether there are a 0:25:11.880,0:25:13.399 lot of missing values in your data set, 0:25:13.399,0:25:15.399 how do you handle them. Another thing 0:25:15.399,0:25:16.919 that's important in data cleansing is 0:25:16.919,0:25:18.799 figuring out the outliers in your data 0:25:18.799,0:25:21.919 set. So outliers are things like this, 0:25:21.919,0:25:24.039 you know, data points that are very far from 0:25:24.039,0:25:26.440 the general trend of data points in your 0:25:26.440,0:25:29.559 data set, right? And so there are also 0:25:29.559,0:25:31.919 several ways to detect outliers in your 0:25:31.919,0:25:34.200 data set, and there are several ways to 0:25:34.200,0:25:36.640 handle outliers in your data set. 0:25:36.640,0:25:38.200 Similarly as well, there are several ways 0:25:38.200,0:25:39.960 to handle missing values in your data 0:25:39.960,0:25:42.880 set. So handling missing values, handling 0:25:42.880,0:25:45.679 outliers, those are really two very key 0:25:45.679,0:25:47.279 importance of data 0:25:47.279,0:25:49.120 cleansing, and there are many, many 0:25:49.120,0:25:50.760 techniques to handle this, so a data 0:25:50.760,0:25:52.000 scientist needs to be acquainted with 0:25:52.000,0:25:55.360 all of this. All right, why do I need to 0:25:55.360,0:25:58.000 do data cleansing? Well, here is the key 0:25:58.000,0:25:59.360 point. 0:25:59.360,0:26:02.799 If you have a very poor quality data set, 0:26:02.799,0:26:04.880 which means you've got a lot of outliers 0:26:04.880,0:26:06.720 which are errors in your data set, or you 0:26:06.720,0:26:08.159 got a lot of missing values in your data 0:26:08.159,0:26:10.840 set, even though you've got a fantastic 0:26:10.840,0:26:13.039 algorithm, you've got a fantastic model, 0:26:13.039,0:26:15.720 the predictions that your model is going 0:26:15.720,0:26:18.960 to give is absolutely rubbish. It's kind 0:26:18.960,0:26:22.080 of like taking water and putting water 0:26:22.080,0:26:26.000 into the tank of a Mercedes-Benz. So 0:26:26.000,0:26:28.440 Mercedes-Benz is a great car, but if you 0:26:28.440,0:26:30.080 take water and put it into your 0:26:30.080,0:26:33.399 Mercedes-Benz, it will just die, right? Your 0:26:33.399,0:26:36.520 car will just die, it can't run on water, 0:26:36.520,0:26:38.279 right? On the other hand, if you have a 0:26:38.279,0:26:41.559 Myvi, Myvi is just a lousy, shit car, but if 0:26:41.559,0:26:44.840 you take a high octane, good petrol STOPPED and 0:26:44.840,0:26:47.240 you point to a MV the MV will just go at 0:26:47.240,0:26:49.480 you know 100 Mil hour it which just 0:26:49.480,0:26:51.159 completely destroy the Mercedes-Benz in 0:26:51.159,0:26:53.360 terms of performance so it doesn't it 0:26:53.360,0:26:54.799 doesn't really matter what model you're 0:26:54.799,0:26:57.080 using right so you can be using the most 0:26:57.080,0:26:58.679 Fantastic Model like the the 0:26:58.679,0:27:01.200 mercedesbenz or machine learning but if 0:27:01.200,0:27:03.080 your data is lousy quality your 0:27:03.080,0:27:06.480 predictions is also going to be rubbish 0:27:06.480,0:27:10.000 okay so cleansing data set is in fact 0:27:10.000,0:27:11.880 probably the most important thing that 0:27:11.880,0:27:13.640 data scientists need to do and that's 0:27:13.640,0:27:15.520 what they spend most of the time doing 0:27:15.520,0:27:17.600 right building the model trading the 0:27:17.600,0:27:20.240 model getting the right algorithms and 0:27:20.240,0:27:23.240 so on that's really a small portion of 0:27:23.240,0:27:25.200 the actual machine learning workflow 0:27:25.200,0:27:27.360 right the actual uh machine learning 0:27:27.360,0:27:29.679 workflow the vast majority of time is on 0:27:29.679,0:27:31.559 cleaning and organizing your 0:27:31.559,0:27:33.360 data then you have something called 0:27:33.360,0:27:35.080 feature engineering which is you 0:27:35.080,0:27:37.000 pre-process the feature variables of 0:27:37.000,0:27:38.919 your original data set prior to using 0:27:38.919,0:27:40.600 them to train the model and this is 0:27:40.600,0:27:41.960 either through addition deletion 0:27:41.960,0:27:43.600 combination or transformation of these 0:27:43.600,0:27:45.399 variables and then the idea is you want 0:27:45.399,0:27:47.000 to improve the predictive accuracy of 0:27:47.000,0:27:49.320 the model and also because some models 0:27:49.320,0:27:51.080 can only work with numeric data so you 0:27:51.080,0:27:53.720 need to transform categorical data into 0:27:53.720,0:27:57.039 numeric data all right so just now um in 0:27:57.039,0:27:58.799 the earlier slides I showed you that you 0:27:58.799,0:28:00.760 take your original data set you pum it 0:28:00.760,0:28:03.200 into algorithm and then couple of hours 0:28:03.200,0:28:05.200 later you get a machine learning model 0:28:05.200,0:28:08.640 right so you didn't do anything to your 0:28:08.640,0:28:10.159 data set to the feature variables in 0:28:10.159,0:28:12.159 your data set before you pump it into a 0:28:12.159,0:28:14.399 machine machine learning algorithm so 0:28:14.399,0:28:15.840 what I showed you earlier is you just 0:28:15.840,0:28:18.919 take the data set exactly as it is and 0:28:18.919,0:28:20.799 you just pump it into the algorithm 0:28:20.799,0:28:23.120 couple of hours later you get the model 0:28:23.120,0:28:27.640 right uh but that's not what generally 0:28:27.640,0:28:29.600 happens in in real life in real life 0:28:29.600,0:28:31.559 you're going to take all the original 0:28:31.559,0:28:34.320 feature variables from your data set and 0:28:34.320,0:28:36.720 you're going to transform them in some 0:28:36.720,0:28:38.960 way so you can see here these are the 0:28:38.960,0:28:42.120 colums of data from my original data set 0:28:42.120,0:28:46.039 and before I actually put all these data 0:28:46.039,0:28:48.240 points from my original data set into my 0:28:48.240,0:28:50.720 algorithm to train and get my model I 0:28:50.720,0:28:54.960 will actually transform them okay so the 0:28:54.960,0:28:57.600 transformation of these feature variable 0:28:57.600,0:29:00.600 values we call this feature engineering 0:29:00.600,0:29:02.440 and there are many many techniques to do 0:29:02.440,0:29:04.960 feature engineering so one hot encoding 0:29:04.960,0:29:08.279 scaling log transformation descri 0:29:08.279,0:29:10.480 discretization date extraction Boolean 0:29:10.480,0:29:12.039 logic etc 0:29:12.039,0:29:14.880 etc okay then finally we do something 0:29:14.880,0:29:16.799 called a train test plate so where we 0:29:16.799,0:29:19.440 take our original data set right so this 0:29:19.440,0:29:21.360 was the original data set and we break 0:29:21.360,0:29:23.720 it into two parts so one is called the 0:29:23.720,0:29:25.760 training data set and the other is 0:29:25.760,0:29:28.120 called the test data set and the primary 0:29:28.120,0:29:30.000 purpose for this is when we feed and 0:29:30.000,0:29:31.399 train the machine learning model we're 0:29:31.399,0:29:32.640 going to use what is called the training 0:29:32.640,0:29:35.559 data set and we when we want to evaluate 0:29:35.559,0:29:37.399 the accuracy of the model right so this 0:29:37.399,0:29:40.960 is the key part of your machine learning 0:29:40.960,0:29:43.640 life cycle because you are not only just 0:29:43.640,0:29:45.440 going to have one possible models 0:29:45.440,0:29:47.720 because there are a vast range of 0:29:47.720,0:29:50.080 algorithms that you can use to create a 0:29:50.080,0:29:53.000 model so fundamentally you have a wide 0:29:53.000,0:29:55.679 range of choices right like wide range 0:29:55.679,0:29:57.640 of cars right you want to buy a car you 0:29:57.640,0:30:00.559 can buy buy a myv you can buy a paroda 0:30:00.559,0:30:02.640 you can buy a Honda you can buy a 0:30:02.640,0:30:05.039 mercedesbenz you can buy a Audi you can 0:30:05.039,0:30:07.760 buy a beamer many many different cars 0:30:07.760,0:30:09.240 you that available for you if you want 0:30:09.240,0:30:11.679 to buy a car right same thing with a 0:30:11.679,0:30:14.360 machine learning model that are aast 0:30:14.360,0:30:16.720 variety of algorithms that you can 0:30:16.720,0:30:19.480 choose from in order to create a model 0:30:19.480,0:30:21.519 and so once you create a model from a 0:30:21.519,0:30:24.480 given algorithm you need to say hey how 0:30:24.480,0:30:26.440 accurate is this model that have created 0:30:26.440,0:30:28.640 from this algorithm and and different 0:30:28.640,0:30:30.399 algorithms are going to create different 0:30:30.399,0:30:33.720 models with different rates of accuracy 0:30:33.720,0:30:35.679 and so the primary purpose of the test 0:30:35.679,0:30:38.200 data set is to evaluate the ACC accuracy 0:30:38.200,0:30:41.480 of the model to see hey is this model 0:30:41.480,0:30:43.360 that I've created using this algorithm 0:30:43.360,0:30:45.880 is it adequate for me to use in a real 0:30:45.880,0:30:48.600 life production use case Okay so that's 0:30:48.600,0:30:52.320 what it's all about okay so this is my 0:30:52.320,0:30:54.279 original data set I break it into my 0:30:54.279,0:30:56.559 feature data uh feature data set and 0:30:56.559,0:30:58.519 also my target variable colum so my 0:30:58.519,0:31:00.639 feature variable uh colums the target 0:31:00.639,0:31:02.200 variable colums and then I further break 0:31:02.200,0:31:04.240 it into a training data set and a test 0:31:04.240,0:31:06.600 data set the training data set is to use 0:31:06.600,0:31:08.320 the train to create the machine learning 0:31:08.320,0:31:10.480 model and then once the machine learning 0:31:10.480,0:31:12.200 model is created I then use the test 0:31:12.200,0:31:15.080 data set to evaluate the accuracy of the 0:31:15.080,0:31:16.279 machine learning 0:31:16.279,0:31:21.000 model all right and then finally we can 0:31:21.000,0:31:23.200 see what are the different parts or 0:31:23.200,0:31:26.080 aspects that go into a successful model 0:31:26.080,0:31:29.519 so Eda about 10% data cleansing about 0:31:29.519,0:31:32.360 20% feature engineering about 0:31:32.360,0:31:36.320 25% selecting a specific algorithm about 0:31:36.320,0:31:39.120 10% and then training the model from 0:31:39.120,0:31:41.639 that algorithm about 15% and then 0:31:41.639,0:31:43.679 finally evaluating the model deciding 0:31:43.679,0:31:45.960 which is the best model with the highest 0:31:45.960,0:31:50.679 accuracy rate that's about 0:31:54.080,0:31:56.919 20% all right so we have reached the 0:31:56.919,0:31:58.880 most interesting part of this 0:31:58.880,0:32:01.039 presentation which is the demonstration 0:32:01.039,0:32:03.760 of an endtoend machine learning workflow 0:32:03.760,0:32:06.080 on a real life data set that 0:32:06.080,0:32:10.080 demonstrates the use case of predictive 0:32:10.080,0:32:13.519 maintenance so the for the data set for 0:32:13.519,0:32:16.240 this particular use case I've used a 0:32:16.240,0:32:19.200 data set from kegle so for those of you 0:32:19.200,0:32:21.399 are not aware of this kegle is the 0:32:21.399,0:32:24.880 world's largest open-source Community 0:32:24.880,0:32:28.080 for data science and Ai and they have a 0:32:28.080,0:32:31.159 large collection of data sets from all 0:32:31.159,0:32:34.440 various uh areas of industry and human 0:32:34.440,0:32:37.039 endeavor and they also have a large 0:32:37.039,0:32:38.840 collection of models that have been 0:32:38.840,0:32:42.880 developed using these data sets so here 0:32:42.880,0:32:47.039 we have a data set for the particular 0:32:47.039,0:32:50.519 use case predictive maintenance okay so 0:32:50.519,0:32:52.919 this is some information about the data 0:32:52.919,0:32:56.440 set uh so in case um you do not know how 0:32:56.440,0:32:59.200 to get to there this is the URL to click 0:32:59.200,0:33:02.240 on okay to get to that data set so once 0:33:02.240,0:33:05.120 you at the data set here you can or the 0:33:05.120,0:33:07.399 page for about this data set you can see 0:33:07.399,0:33:09.960 all the information about this data set 0:33:09.960,0:33:13.039 and you can download the data set in a 0:33:13.039,0:33:14.159 CSV 0:33:14.159,0:33:16.360 format okay so let's take a look at the 0:33:16.360,0:33:19.559 data set so this data set has a total of 0:33:19.559,0:33:23.440 10,000 samples okay and these are the 0:33:23.440,0:33:26.279 feature variables the type the product 0:33:26.279,0:33:28.440 ID the add temperature process 0:33:28.440,0:33:31.000 temperature rotational speed talk tool 0:33:31.000,0:33:34.799 Weare and this is the target variable 0:33:34.799,0:33:36.720 all right so the target variable is what 0:33:36.720,0:33:38.159 we are interested in what we are 0:33:38.159,0:33:40.960 interested in using to train the machine 0:33:40.960,0:33:42.600 learning model and also what we 0:33:42.600,0:33:45.279 interested to predict okay so these are 0:33:45.279,0:33:47.960 the feature variables they describe or 0:33:47.960,0:33:49.960 they provide information about this 0:33:49.960,0:33:52.880 particular machine on the production 0:33:52.880,0:33:55.080 line on the assembly line so you might 0:33:55.080,0:33:56.799 know the product ID the type the air 0:33:56.799,0:33:58.120 temperature process temperature 0:33:58.120,0:34:00.480 rotational speed talk to where right so 0:34:00.480,0:34:03.159 let's say you've got a iot sensor system 0:34:03.159,0:34:06.120 that's basically capturing all this data 0:34:06.120,0:34:08.359 about a product or a machine on your 0:34:08.359,0:34:10.679 production or assembly line okay and 0:34:10.679,0:34:13.918 you've also captured information about 0:34:13.918,0:34:17.199 whether is for a specific uh sample 0:34:17.199,0:34:19.839 whether that sample uh experien a 0:34:19.839,0:34:23.040 failure or not okay so the target value 0:34:23.040,0:34:25.520 of zero okay indicates that there's no 0:34:25.520,0:34:28.000 failure so zero means no failure and we 0:34:28.000,0:34:30.199 can see that the vast majority of data 0:34:30.199,0:34:32.520 points in this data set are no failure 0:34:32.520,0:34:34.000 and here we can see an example here 0:34:34.000,0:34:36.719 where you have a case of a failure so a 0:34:36.719,0:34:40.159 failure is marked as a one positive and 0:34:40.159,0:34:42.639 no failure is marked as zero negative 0:34:42.639,0:34:44.879 all right so here we have one type of a 0:34:44.879,0:34:47.040 failure it's called a power failure and 0:34:47.040,0:34:49.000 if you scroll down the data set you see 0:34:49.000,0:34:50.399 there are also other kinds of failures 0:34:50.399,0:34:52.839 like a towar 0:34:52.839,0:34:56.960 failure uh we have a over strain failure 0:34:56.960,0:34:58.680 here for example 0:34:58.680,0:35:00.760 uh we also have a power failure again 0:35:00.760,0:35:02.200 and so on so if you scroll down through 0:35:02.200,0:35:04.160 these 10,000 data points and or if 0:35:04.160,0:35:06.040 you're familiar with using Excel to 0:35:06.040,0:35:08.839 filter out values in a colume you can 0:35:08.839,0:35:12.280 see that in this particular colume here 0:35:12.280,0:35:14.480 which is the so-called Target variable 0:35:14.480,0:35:16.960 colume you are going to have the vast 0:35:16.960,0:35:18.920 majority of values as zero which means 0:35:18.920,0:35:22.760 no failure and some of the rows or the 0:35:22.760,0:35:24.040 data points you are going to have a 0:35:24.040,0:35:26.359 value of one and for those rows that you 0:35:26.359,0:35:28.119 have a value of one for example example 0:35:28.119,0:35:31.280 here you are sorry for example here you 0:35:31.280,0:35:32.839 are going to have different types of 0:35:32.839,0:35:34.640 failure so like I said just now power 0:35:34.640,0:35:38.960 failure tool set filia etc etc so we are 0:35:38.960,0:35:40.640 going to go through the entire machine 0:35:40.640,0:35:43.599 learning workflow process with this data 0:35:43.599,0:35:46.640 set so to see an example of that we are 0:35:46.640,0:35:50.400 going to use a we're going to go to the 0:35:50.400,0:35:52.280 code section here all right so if I 0:35:52.280,0:35:54.280 click on the code section here and right 0:35:54.280,0:35:56.400 down here we have see what is called a 0:35:56.400,0:35:59.359 data set notebook so this is basically a 0:35:59.359,0:36:02.319 Jupiter notebook Jupiter is basically an 0:36:02.319,0:36:05.280 python application which allows you to 0:36:05.280,0:36:09.240 create a python machine learning 0:36:09.240,0:36:11.680 program that basically builds your 0:36:11.680,0:36:14.520 machine learning model assesses or 0:36:14.520,0:36:16.480 evaluates his accuracy and generates 0:36:16.480,0:36:19.040 predictions from it okay so here we have 0:36:19.040,0:36:21.680 a whole bunch of Jupiter notebooks that 0:36:21.680,0:36:24.560 are available and you can select any one 0:36:24.560,0:36:26.000 of them all these notebooks are 0:36:26.000,0:36:28.720 essentially going to process the data 0:36:28.720,0:36:31.720 from this particular data set so if I go 0:36:31.720,0:36:34.720 to this code page here I've actually 0:36:34.720,0:36:37.319 selected a specific notebook that I'm 0:36:37.319,0:36:39.960 going to run through to demonstrate an 0:36:39.960,0:36:42.839 endtoend machine learning workflow using 0:36:42.839,0:36:45.560 various machine learning libraries from 0:36:45.560,0:36:49.800 the Python programming language okay so 0:36:49.800,0:36:52.440 the uh particular notebook I'm going to 0:36:52.440,0:36:55.160 use is this particular notebook here and 0:36:55.160,0:36:57.160 you can also get the URL for that 0:36:57.160,0:37:00.440 particular The Notebook from 0:37:00.440,0:37:03.760 here okay so let's quickly do a quick 0:37:03.760,0:37:06.000 revision again what are we trying to do 0:37:06.000,0:37:08.000 here we're trying to build a machine 0:37:08.000,0:37:11.359 learning classification model right so 0:37:11.359,0:37:12.960 we said there are two primary areas of 0:37:12.960,0:37:14.560 supervised learning one is regression 0:37:14.560,0:37:16.200 which is used to predict a numerical 0:37:16.200,0:37:18.640 Target variable and the second kind of 0:37:18.640,0:37:21.359 supervised learning is classification 0:37:21.359,0:37:23.079 which is what we're doing here we're 0:37:23.079,0:37:25.839 trying to predict a categorical Target 0:37:25.839,0:37:29.680 variable okay so in this particular 0:37:29.680,0:37:32.119 example we actually have two kinds of 0:37:32.119,0:37:34.480 ways we can classify either a binary 0:37:34.480,0:37:37.560 classification or a multiclass 0:37:37.560,0:37:39.520 classification so for binary 0:37:39.520,0:37:41.440 classification we are only going to 0:37:41.440,0:37:43.400 classify the product or machine as 0:37:43.400,0:37:47.160 either it failed or it did not fail okay 0:37:47.160,0:37:48.880 so if we go back to the data set that I 0:37:48.880,0:37:50.839 showed you just now if you look at this 0:37:50.839,0:37:52.680 target variable colume there are only 0:37:52.680,0:37:54.520 two possible values here they either 0:37:54.520,0:37:58.280 zero or one zero means there's no fi 0:37:58.280,0:38:01.240 one means that's a failure okay so this 0:38:01.240,0:38:03.440 is an example of a binary classification 0:38:03.440,0:38:07.240 only two possible outcomes zero or one 0:38:07.240,0:38:10.119 didn't fail or fail all right two 0:38:10.119,0:38:13.079 possible outcomes and then we can also 0:38:13.079,0:38:15.480 for the same data set we can extend it 0:38:15.480,0:38:18.079 and make it a multiclass classification 0:38:18.079,0:38:20.880 problem all right so if we kind of want 0:38:20.880,0:38:23.720 to drill down further we can say that 0:38:23.720,0:38:26.800 not only is there a failure we can 0:38:26.800,0:38:29.200 actually say that are different types of 0:38:29.200,0:38:32.440 failures okay so we have one category of 0:38:32.440,0:38:35.599 class that is basically no failure okay 0:38:35.599,0:38:37.400 then we have a category for the 0:38:37.400,0:38:40.400 different types of failures right so you 0:38:40.400,0:38:43.920 can have a power failure you could have 0:38:43.920,0:38:46.400 a tool Weare 0:38:46.400,0:38:48.920 failure uh you could have let's go down 0:38:48.920,0:38:50.880 here you could have a over strain 0:38:50.880,0:38:53.760 failure and etc etc so you can have 0:38:53.760,0:38:57.160 multiple classes of failure in addition 0:38:57.160,0:39:00.520 to the general overall or the majority 0:39:00.520,0:39:04.319 class of no failure and that would be a 0:39:04.319,0:39:06.680 multiclass classification problem so 0:39:06.680,0:39:08.400 with this data set we are going to see 0:39:08.400,0:39:11.040 how to make it a binary classification 0:39:11.040,0:39:12.800 problem and also a multiclass 0:39:12.800,0:39:15.079 classification problem okay so let's 0:39:15.079,0:39:16.880 look at the workflow so let's say we've 0:39:16.880,0:39:18.880 already got the data so right now we do 0:39:18.880,0:39:20.839 have the data set this is the data set 0:39:20.839,0:39:22.720 that we have so let's assume we've 0:39:22.720,0:39:24.560 somehow managed to get this data set 0:39:24.560,0:39:26.880 from some iot sensors that are 0:39:26.880,0:39:29.119 monitoring realtime data in our 0:39:29.119,0:39:31.079 production environment on the assembly 0:39:31.079,0:39:32.800 line on the production line we've got 0:39:32.800,0:39:34.680 sensors reading data that gives us all 0:39:34.680,0:39:37.960 these data that we have in this CSV file 0:39:37.960,0:39:40.079 Okay so we've already got the data we've 0:39:40.079,0:39:41.599 retrieved the data now we're going to go 0:39:41.599,0:39:45.000 on to the cleaning and exploration part 0:39:45.000,0:39:47.520 of your machine learning life cycle all 0:39:47.520,0:39:49.800 right so let's look at the data cleaning 0:39:49.800,0:39:51.400 part so the data cleaning part we 0:39:51.400,0:39:53.720 interested in uh checking for missing 0:39:53.720,0:39:56.200 values and maybe removing the rows you 0:39:56.200,0:39:58.079 missing values okay 0:39:58.079,0:39:59.760 uh so the kind of things we can sorry 0:39:59.760,0:40:01.000 the kind of things we can do in missing 0:40:01.000,0:40:02.880 values we can remove the row missing 0:40:02.880,0:40:05.839 values we can put in some new values uh 0:40:05.839,0:40:08.000 some replacement values which could be a 0:40:08.000,0:40:09.880 average of all the values in that that 0:40:09.880,0:40:12.880 particular colume etc etc we also try to 0:40:12.880,0:40:15.480 identify outliers in our data set and 0:40:15.480,0:40:17.480 also there are a variety of ways to deal 0:40:17.480,0:40:19.480 with that so this is called Data 0:40:19.480,0:40:21.359 cleansing which is a really important 0:40:21.359,0:40:23.319 part of your machine learning workflow 0:40:23.319,0:40:25.520 right so that's where we are now at 0:40:25.520,0:40:26.839 we're doing cleansing and then we're 0:40:26.839,0:40:28.839 going to follow up with 0:40:28.839,0:40:31.160 exploration so let's look at the actual 0:40:31.160,0:40:33.160 code that does the cleansing here so 0:40:33.160,0:40:35.800 here we are right at the start of the uh 0:40:35.800,0:40:38.400 machine learning uh life cycle here so 0:40:38.400,0:40:40.839 this is a Jupiter notebook so here we 0:40:40.839,0:40:43.359 have a brief description of the problem 0:40:43.359,0:40:45.920 statement all right so this data set 0:40:45.920,0:40:47.640 reflects real life predictive 0:40:47.640,0:40:49.240 maintenance enounter industry with 0:40:49.240,0:40:50.480 measurements from real equipment the 0:40:50.480,0:40:52.400 features description is taken directly 0:40:52.400,0:40:54.520 from the data source set so here we have 0:40:54.520,0:40:57.400 a description of the six key features in 0:40:57.400,0:40:59.599 our data set type which is the quality 0:40:59.599,0:41:02.520 of the product the air temperature the 0:41:02.520,0:41:04.680 process temperature the rotational speed 0:41:04.680,0:41:06.599 the talk and the towar all right so 0:41:06.599,0:41:08.880 these are the six feature variables and 0:41:08.880,0:41:11.319 there are the two target variables so 0:41:11.319,0:41:13.119 just now I showed you just now there's 0:41:13.119,0:41:15.119 one target variable which only has two 0:41:15.119,0:41:17.440 possible values either zero or one okay 0:41:17.440,0:41:20.079 zero or one means failure or no failure 0:41:20.079,0:41:23.079 so that will be this colume here right 0:41:23.079,0:41:24.880 so let me go all the way back up to here 0:41:24.880,0:41:26.640 so this colume here we already saw it 0:41:26.640,0:41:29.440 only has two I values is either zero or 0:41:29.440,0:41:32.680 one and then we also have this column 0:41:32.680,0:41:35.040 here and this column here is basically 0:41:35.040,0:41:38.079 the failure type and so the we have as I 0:41:38.079,0:41:40.800 already demonstrated just now we do have 0:41:40.800,0:41:43.440 uh several categories of or types of 0:41:43.440,0:41:45.560 failure and so here we call this 0:41:45.560,0:41:47.079 multiclass 0:41:47.079,0:41:50.000 classification so we can either build a 0:41:50.000,0:41:51.839 binary classification model for this 0:41:51.839,0:41:53.520 problem domain or we can build a 0:41:53.520,0:41:55.079 multiclass 0:41:55.079,0:41:58.119 classification problem all right so this 0:41:58.119,0:41:59.839 jupyter notebook is going to demonstrate 0:41:59.839,0:42:02.319 both approaches to us so first step we 0:42:02.319,0:42:04.800 are going to write all this python code 0:42:04.800,0:42:06.880 that's going to import all the libraries 0:42:06.880,0:42:09.079 that we need to use okay so this is 0:42:09.079,0:42:12.319 basically python code okay and it's 0:42:12.319,0:42:15.119 importing the relevant machine learn 0:42:15.119,0:42:17.960 oops we are importing the relevant 0:42:17.960,0:42:20.599 machine learning libraries related to 0:42:20.599,0:42:23.520 our domain use case okay then we load in 0:42:23.520,0:42:26.440 our data set okay so this our data set 0:42:26.440,0:42:28.319 we describe it we have some quick 0:42:28.319,0:42:30.920 insights into the data set um and then 0:42:30.920,0:42:32.839 we just take a look at all the variables 0:42:32.839,0:42:36.000 of the feature variables Etc and so on 0:42:36.000,0:42:38.000 we just what we're doing now is just 0:42:38.000,0:42:39.800 doing a quick overview of the data set 0:42:39.800,0:42:41.559 so this all this python code here they 0:42:41.559,0:42:43.760 were writing is allowing us the data 0:42:43.760,0:42:45.359 scientist to get a quick overview of our 0:42:45.359,0:42:48.359 data set right okay like how many um V 0:42:48.359,0:42:50.240 how many rows are there how many columns 0:42:50.240,0:42:51.760 are there what are the data types of the 0:42:51.760,0:42:53.440 colums what are the name of the columns 0:42:53.440,0:42:57.359 etc etc okay then we zoom in on to the 0:42:57.359,0:42:58.839 Target variables so we look at the 0:42:58.839,0:43:02.000 Target variables how many uh counts 0:43:02.000,0:43:04.520 there are of this target variable uh and 0:43:04.520,0:43:06.440 so on how many different types of 0:43:06.440,0:43:08.240 failures there are then you want to 0:43:08.240,0:43:09.000 check whether there are any 0:43:09.000,0:43:10.760 inconsistencies between the Target and 0:43:10.760,0:43:13.559 the failure type Etc okay so when you do 0:43:13.559,0:43:15.119 all this checking you're going to 0:43:15.119,0:43:16.960 discover there are some discrepancies in 0:43:16.960,0:43:20.280 your data set so using a specific python 0:43:20.280,0:43:21.839 code to do checking you're going to say 0:43:21.839,0:43:23.480 hey you know what there's some errors 0:43:23.480,0:43:25.000 here right there are nine values that 0:43:25.000,0:43:26.599 classify as failure and Target variable 0:43:26.599,0:43:28.200 but as no no failure in the failure type 0:43:28.200,0:43:29.720 variable so that means there's a 0:43:29.720,0:43:33.200 discrepancy in your data point right so 0:43:33.200,0:43:34.760 which are so these are all the ones that 0:43:34.760,0:43:36.359 are discrepancies because the target 0:43:36.359,0:43:39.000 variable says one and we already know 0:43:39.000,0:43:41.240 that Target variable one is supposed to 0:43:41.240,0:43:43.240 mean that it's a failure right target 0:43:43.240,0:43:44.880 varable one is supposed to mean that is 0:43:44.880,0:43:47.119 a failure so we are kind of expecting to 0:43:47.119,0:43:49.680 see the failure classification but some 0:43:49.680,0:43:51.400 rows actually say there's no failure 0:43:51.400,0:43:53.800 although the target type is one but here 0:43:53.800,0:43:55.920 is a classic example of an error that 0:43:55.920,0:43:58.640 can very well Ur in a data set so now 0:43:58.640,0:44:00.559 the question is what do you do with 0:44:00.559,0:44:04.720 these errors in your data set right so 0:44:04.720,0:44:06.240 here the data scientist says I think it 0:44:06.240,0:44:07.520 would make sense to remove those 0:44:07.520,0:44:09.920 instances and so they write some code 0:44:09.920,0:44:12.680 then to remove those instances or those 0:44:12.680,0:44:14.920 uh rows or data points from the overall 0:44:14.920,0:44:17.280 data set and same thing we can again 0:44:17.280,0:44:19.240 check for other ISU so we find there's 0:44:19.240,0:44:21.160 another ISU here with our data set which 0:44:21.160,0:44:24.079 is another warning so again we can 0:44:24.079,0:44:26.240 possibly remove them so you're going to 0:44:26.240,0:44:31.280 remove 20 7 instances or rows from your 0:44:31.280,0:44:34.440 overall data set so your data set has a 0:44:34.440,0:44:37.079 10,000 uh rows or data points you're 0:44:37.079,0:44:40.160 removing 27 which is only 0.27 of the 0:44:40.160,0:44:42.240 entire data set and these were the 0:44:42.240,0:44:45.720 reasons why you remove them okay so if 0:44:45.720,0:44:48.160 you're just removing to uh 0.27% of the 0:44:48.160,0:44:50.800 anti data set no big deal right still 0:44:50.800,0:44:53.079 okay but you needed to remove them 0:44:53.079,0:44:55.720 because these errors right this 0:44:55.720,0:44:58.040 27 um 0:44:58.040,0:45:00.559 errors okay data points with errors in 0:45:00.559,0:45:02.960 your data set could really affect the 0:45:02.960,0:45:05.000 training of your machine learning model 0:45:05.000,0:45:08.640 so we need to do your data cleansing 0:45:08.640,0:45:11.720 right so we are actually cleansing now 0:45:11.720,0:45:15.200 uh uh some kind of data that is 0:45:15.200,0:45:17.520 incorrect or erroneous in your original 0:45:17.520,0:45:21.440 data set okay so then we go on to the 0:45:21.440,0:45:23.839 next part which is called Eda right so 0:45:23.839,0:45:28.880 Eda is where we kind of explore our data 0:45:28.880,0:45:31.720 and we want to kind of get a visual 0:45:31.720,0:45:34.240 overview of our data as a whole and also 0:45:34.240,0:45:35.880 take a look at the statistical 0:45:35.880,0:45:38.160 properties of data the statistical 0:45:38.160,0:45:40.480 distribution of the data in all the 0:45:40.480,0:45:43.079 various colums the correlation between 0:45:43.079,0:45:44.640 the variables between the feature 0:45:44.640,0:45:46.680 variables different columns and also the 0:45:46.680,0:45:48.599 feature variable and the target variable 0:45:48.599,0:45:52.040 so all of this is called Eda and Eda in 0:45:52.040,0:45:54.079 a machine learning workflow is typically 0:45:54.079,0:45:57.160 done through visualization 0:45:57.160,0:45:58.839 all right so let's go back here and take 0:45:58.839,0:46:00.599 a look right so for example here we are 0:46:00.599,0:46:03.400 looking at correlation so we plot the 0:46:03.400,0:46:05.680 values of all the various feature 0:46:05.680,0:46:07.599 variables against each other and look 0:46:07.599,0:46:10.800 for potential correlations and patterns 0:46:10.800,0:46:13.359 and so on and all the different shapes 0:46:13.359,0:46:17.280 that you see here in this pair plot okay 0:46:17.280,0:46:18.400 uh will have different meaning 0:46:18.400,0:46:20.000 statistical meaning and so the data 0:46:20.000,0:46:21.800 scientist has to kind of visually 0:46:21.800,0:46:23.760 inspect this P plot makes some 0:46:23.760,0:46:25.559 interpretations of these different 0:46:25.559,0:46:27.680 patterns that he sees here all right so 0:46:27.680,0:46:30.480 these are some of the insights that that 0:46:30.480,0:46:32.839 can be deduced from looking at these 0:46:32.839,0:46:34.319 pattern so for example the Tor and 0:46:34.319,0:46:36.280 rotational speed are highly correlated 0:46:36.280,0:46:38.040 the process temperature and a 0:46:38.040,0:46:39.920 temperature so highly correlated that 0:46:39.920,0:46:41.559 failures occur for extreme values of 0:46:41.559,0:46:44.520 some features etc etc then you can plot 0:46:44.520,0:46:45.960 certain kinds of charts this called a 0:46:45.960,0:46:48.480 violing chart to again get new insights 0:46:48.480,0:46:49.839 for example regarding the talk and 0:46:49.839,0:46:51.480 rotational speed it can see again that 0:46:51.480,0:46:53.119 most failures are triggered for much 0:46:53.119,0:46:55.119 lower or much higher values than the 0:46:55.119,0:46:57.400 mean when they're not failing so all 0:46:57.400,0:47:00.720 these visualizations they are there and 0:47:00.720,0:47:02.480 a trained data scientist can look at 0:47:02.480,0:47:05.079 them inspect them and make some kind of 0:47:05.079,0:47:08.400 insightful deductions from them okay 0:47:08.400,0:47:11.079 percentage of failure right uh the 0:47:11.079,0:47:13.640 correlation heat map okay between all 0:47:13.640,0:47:15.559 these different feature variables and 0:47:15.559,0:47:16.920 also the target 0:47:16.920,0:47:19.599 variable okay uh the product types 0:47:19.599,0:47:21.079 percentage of product types percentage 0:47:21.079,0:47:23.160 of failure with respect to the product 0:47:23.160,0:47:25.720 type so we can also kind of visualize 0:47:25.720,0:47:27.800 that as well so certain products have a 0:47:27.800,0:47:29.839 higher ratio of faure compared to other 0:47:29.839,0:47:33.240 product types Etc or for example uh M 0:47:33.240,0:47:35.800 tends to feel more than H products etc 0:47:35.800,0:47:38.880 etc so we can create a vast variety of 0:47:38.880,0:47:41.319 visualizations in the Eda stage so you 0:47:41.319,0:47:43.960 can see here and again the idea of this 0:47:43.960,0:47:46.359 visualization is just to give us some 0:47:46.359,0:47:49.680 insight some preliminary insight into 0:47:49.680,0:47:52.520 our data set that helps us to model it 0:47:52.520,0:47:54.119 more correctly so some more insights 0:47:54.119,0:47:56.200 that we get into our data set from all 0:47:56.200,0:47:57.599 this visualization 0:47:57.599,0:47:59.559 then we can plot the distribution so we 0:47:59.559,0:48:00.720 can see whether it's a normal 0:48:00.720,0:48:03.079 distribution or some other kind of 0:48:03.079,0:48:05.640 distribution uh we can have a box plot 0:48:05.640,0:48:07.760 to see whether there are any outliers in 0:48:07.760,0:48:10.400 your data set and so on right so we can 0:48:10.400,0:48:11.640 see from the box plots we can see 0:48:11.640,0:48:14.599 rotational speed and have outliers so we 0:48:14.599,0:48:16.880 already saw outliers are basically a 0:48:16.880,0:48:18.800 problem that you may need to kind of 0:48:18.800,0:48:22.520 tackle right so outliers are an isue uh 0:48:22.520,0:48:24.800 it's a it's a part of data cleansing and 0:48:24.800,0:48:26.960 so you may need to tackle this so we may 0:48:26.960,0:48:28.880 have to check okay well where are the 0:48:28.880,0:48:31.319 potential outliers so we can analyze 0:48:31.319,0:48:35.319 them from the box blot okay um but then 0:48:35.319,0:48:37.079 we can say well they are outliers but 0:48:37.079,0:48:38.800 maybe they're not really horrible 0:48:38.800,0:48:40.760 outliers so we can tolerate them or 0:48:40.760,0:48:42.880 maybe we want to remove them so we can 0:48:42.880,0:48:44.920 see what the mean and maximum values for 0:48:44.920,0:48:46.720 all these with respect to product type 0:48:46.720,0:48:49.680 how many of them are above or highly 0:48:49.680,0:48:51.440 correlated with the product type in 0:48:51.440,0:48:54.240 terms of the maximum and minimum okay 0:48:54.240,0:48:56.960 and then so on so the Insight is well we 0:48:56.960,0:48:59.599 got 4.8% of the instances are outliers 0:48:59.599,0:49:02.559 so maybe 4.87% is not really that much 0:49:02.559,0:49:04.920 the outliers are not horrible so we just 0:49:04.920,0:49:06.960 leave them in the data set now for a 0:49:06.960,0:49:08.520 different data set the data scientist 0:49:08.520,0:49:10.280 could come to different conclusion so 0:49:10.280,0:49:12.280 then they would do whatever they've 0:49:12.280,0:49:15.400 deemed is appropriate to kind of cleanse 0:49:15.400,0:49:18.079 the data set okay so now that we have 0:49:18.079,0:49:20.000 done all the Eda the next thing we're 0:49:20.000,0:49:23.160 going to do is we are going to do what 0:49:23.160,0:49:26.200 is called feature engineering so we are 0:49:26.200,0:49:28.760 going to transform our original feature 0:49:28.760,0:49:31.280 variables and these are our original 0:49:31.280,0:49:32.960 feature variables right these are our 0:49:32.960,0:49:35.040 original feature variables and we are 0:49:35.040,0:49:37.760 going to transform them all right we're 0:49:37.760,0:49:40.319 going to transform them in some sense uh 0:49:40.319,0:49:43.760 into some other form before we fit this 0:49:43.760,0:49:45.640 for training into our machine learning 0:49:45.640,0:49:48.599 algorithm all right so these are 0:49:48.599,0:49:51.599 examples of let's say this example of a 0:49:51.599,0:49:55.200 original data set right and this is 0:49:55.200,0:49:56.839 examples these are some of the examples 0:49:56.839,0:49:58.040 you don't have to use all of them but 0:49:58.040,0:49:59.440 these are some of examples of what we 0:49:59.440,0:50:00.839 call feature engineering which you can 0:50:00.839,0:50:03.559 then transform your original values in 0:50:03.559,0:50:05.280 your feature variables to all these 0:50:05.280,0:50:07.920 transform values here so we're going to 0:50:07.920,0:50:09.680 pretty much do that here so we have a 0:50:09.680,0:50:12.599 ordinal encoding we do scaling of the 0:50:12.599,0:50:14.839 data so the data set is scaled we use a 0:50:14.839,0:50:18.240 minmax scaling and then finally we come 0:50:18.240,0:50:21.720 to do a modeling so we have to split our 0:50:21.720,0:50:24.359 data set into a training data set and a 0:50:24.359,0:50:28.640 test data set so coming back to again um 0:50:28.640,0:50:32.160 we said that in a before you train your 0:50:32.160,0:50:33.799 model sorry before you train your model 0:50:33.799,0:50:35.599 you have to take your original data set 0:50:35.599,0:50:37.319 now this is a featured engineered data 0:50:37.319,0:50:38.839 set we're going to break it into two or 0:50:38.839,0:50:40.839 more subsets okay so one is called the 0:50:40.839,0:50:42.400 training data set that we use to Feit 0:50:42.400,0:50:44.000 and train a machine learning model the 0:50:44.000,0:50:45.920 second is test data set to evaluate the 0:50:45.920,0:50:47.960 accuracy of the model okay so we got 0:50:47.960,0:50:50.559 this training data set your test data 0:50:50.559,0:50:52.720 set and we also need 0:50:52.720,0:50:56.160 to sample so from our original data set 0:50:56.160,0:50:57.400 we need to sample sample some points 0:50:57.400,0:50:58.839 that go into your training data set some 0:50:58.839,0:51:00.559 points that go in your test data set so 0:51:00.559,0:51:02.720 there are many ways to do sampling one 0:51:02.720,0:51:04.920 way is to do stratified sampling where 0:51:04.920,0:51:06.720 we ensure the same proportion of data 0:51:06.720,0:51:09.000 from each steta or class because right 0:51:09.000,0:51:10.960 now we have a multiclass classification 0:51:10.960,0:51:12.319 problem so you want to make sure the 0:51:12.319,0:51:13.960 same proportion of data from each TR 0:51:13.960,0:51:15.839 class is equally proportional in the 0:51:15.839,0:51:17.920 training and test data set as the 0:51:17.920,0:51:20.119 original data set which is very useful 0:51:20.119,0:51:21.640 for dealing with what is called an 0:51:21.640,0:51:24.319 imbalanced data set so here we have an 0:51:24.319,0:51:25.839 example of what is called an imbalanced 0:51:25.839,0:51:29.520 data set in the sense that you have the 0:51:29.520,0:51:32.760 vast majority of data points in your 0:51:32.760,0:51:34.960 data set they are going to have the 0:51:34.960,0:51:37.480 value of zero for their target variable 0:51:37.480,0:51:40.200 colume so only a extremely small 0:51:40.200,0:51:43.119 minority of the data points in your data 0:51:43.119,0:51:45.319 set will actually have the value of one 0:51:45.319,0:51:48.720 for their target variable colume okay so 0:51:48.720,0:51:51.040 a situation where you have your class or 0:51:51.040,0:51:52.520 your target variable colume where the 0:51:52.520,0:51:54.480 vast majority of values are from one 0:51:54.480,0:51:58.119 class and a tiny small minority are from 0:51:58.119,0:52:00.520 another class we call this an imbalanced 0:52:00.520,0:52:02.720 data set and for an imbalanced data set 0:52:02.720,0:52:04.319 typically we will have a specific 0:52:04.319,0:52:05.920 technique to do the train test split 0:52:05.920,0:52:08.119 which is called stratified sampling and 0:52:08.119,0:52:09.599 so that's what's exactly happening here 0:52:09.599,0:52:12.000 we're doing a stratified split here so 0:52:12.000,0:52:14.839 we are doing a train test split here uh 0:52:14.839,0:52:17.520 and we are doing a stratified split uh 0:52:17.520,0:52:20.359 and then now we actually develop the 0:52:20.359,0:52:23.359 models so now we've got the train test 0:52:23.359,0:52:25.480 plate now here is where we actually 0:52:25.480,0:52:27.079 train the models 0:52:27.079,0:52:29.920 now in terms of classification there are 0:52:29.920,0:52:32.319 a whole bunch of 0:52:32.319,0:52:35.400 possibilities right that you can use 0:52:35.400,0:52:38.480 there are many many different algorithms 0:52:38.480,0:52:41.000 that we can use to create a 0:52:41.000,0:52:42.839 classification model so this are an 0:52:42.839,0:52:45.079 example of some of the more common ones 0:52:45.079,0:52:47.480 logistic support Vector machine decision 0:52:47.480,0:52:49.520 trees random Forest bagging balance 0:52:49.520,0:52:52.720 bagging boost assemble Ensemble so all 0:52:52.720,0:52:55.040 these are different algorithms which 0:52:55.040,0:52:57.760 will create different kind of models 0:52:57.760,0:53:01.599 which will result in different accuracy 0:53:01.599,0:53:05.400 measures okay so it's the goal of the 0:53:05.400,0:53:08.920 data scientist to find the best model 0:53:08.920,0:53:11.520 that gives the best accuracy for the 0:53:11.520,0:53:14.119 given data set for training on that 0:53:14.119,0:53:16.880 given data set so let's head back again 0:53:16.880,0:53:19.760 to uh our machine learning workflow so 0:53:19.760,0:53:21.520 here basically what I'm doing is I'm 0:53:21.520,0:53:23.520 creating a whole bunch of models here 0:53:23.520,0:53:25.520 all right so one is a random Forest one 0:53:25.520,0:53:27.160 is balance bagging one is a boost 0:53:27.160,0:53:29.520 classifier one's The Ensemble classifier 0:53:29.520,0:53:32.760 and using all of these I am going to 0:53:32.760,0:53:35.319 basically Feit or train my model using 0:53:35.319,0:53:37.440 all these algorithms and then I'm going 0:53:37.440,0:53:39.799 to evaluate them okay I'm going to 0:53:39.799,0:53:42.480 evaluate how good each of these models 0:53:42.480,0:53:45.760 are and here you can see your value your 0:53:45.760,0:53:48.839 evaluation data right okay and this is 0:53:48.839,0:53:50.839 the confusion Matrix which is another 0:53:50.839,0:53:54.280 way of evaluating so now we come to the 0:53:54.280,0:53:56.319 kind of the the the key part here which 0:53:56.319,0:53:58.520 is which is how do I distinguish between 0:53:58.520,0:54:00.079 all these models right I've got all 0:54:00.079,0:54:01.400 these different models which are built 0:54:01.400,0:54:03.040 with different algorithms which I'm 0:54:03.040,0:54:05.359 using to train on the same data set how 0:54:05.359,0:54:07.359 do I distinguish between all these 0:54:07.359,0:54:10.359 models okay and so for that sense for 0:54:10.359,0:54:13.880 that we actually have a whole bunch of 0:54:13.880,0:54:16.200 common evaluation matrics for 0:54:16.200,0:54:18.319 classification right so this evaluation 0:54:18.319,0:54:22.240 matrics tell us how good a model is in 0:54:22.240,0:54:24.319 terms of its accuracy in 0:54:24.319,0:54:27.000 classification so in terms of 0:54:27.000,0:54:29.440 accuracy we actually have many different 0:54:29.440,0:54:31.680 models uh sorry many different measures 0:54:31.680,0:54:33.440 right you might think well accuracy is 0:54:33.440,0:54:35.400 just accuracy well that's all right it's 0:54:35.400,0:54:36.880 just either it's accurate or it's not 0:54:36.880,0:54:39.319 accurate right but actually it's not 0:54:39.319,0:54:41.359 that simple there are many different 0:54:41.359,0:54:43.839 ways to measure the accuracy of a 0:54:43.839,0:54:45.480 classification model and these are some 0:54:45.480,0:54:48.280 of the more common ones so for example 0:54:48.280,0:54:51.000 the confusion metrix tells us how many 0:54:51.000,0:54:54.000 true positives that means the value is 0:54:54.000,0:54:55.880 positive the prediction is positive how 0:54:55.880,0:54:57.520 many false FAL positives which means the 0:54:57.520,0:54:59.040 value is negative the machine learning 0:54:59.040,0:55:01.839 model predicts positive how many false 0:55:01.839,0:55:03.839 negatives which means that the machine 0:55:03.839,0:55:05.559 learning model predicts negative but 0:55:05.559,0:55:07.480 it's actually positive and how many true 0:55:07.480,0:55:09.359 negatives there are which means that the 0:55:09.359,0:55:11.240 machine the machine learning model 0:55:11.240,0:55:12.880 predicts negative and the true value is 0:55:12.880,0:55:14.760 also negative so this is called a 0:55:14.760,0:55:16.920 confusion Matrix this is one way we 0:55:16.920,0:55:19.480 assess or evaluate the performance of a 0:55:19.480,0:55:20.520 classification 0:55:20.520,0:55:23.319 model okay this is for binary 0:55:23.319,0:55:24.680 classification we can also have 0:55:24.680,0:55:26.880 multiclass confusion Matrix 0:55:26.880,0:55:29.000 and then we can also measure things like 0:55:29.000,0:55:31.720 accuracy so accuracy is the true 0:55:31.720,0:55:34.079 positives plus the true negatives which 0:55:34.079,0:55:35.440 is the total number of correct 0:55:35.440,0:55:37.839 predictions made by the model divided by 0:55:37.839,0:55:39.839 the total number of data points in your 0:55:39.839,0:55:42.599 data set and then you have also other 0:55:42.599,0:55:43.720 kinds of 0:55:43.720,0:55:46.599 measures uh such as recall and this is a 0:55:46.599,0:55:49.160 formula for recall this is a formula for 0:55:49.160,0:55:51.480 the F1 score okay and then there's 0:55:51.480,0:55:55.559 something called the uh R curve right so 0:55:55.559,0:55:57.039 without going too much in the detail of 0:55:57.039,0:55:59.000 what each of these entails essentially 0:55:59.000,0:56:00.640 these are all different ways these are 0:56:00.640,0:56:03.280 different kpi right just like if you 0:56:03.280,0:56:06.119 work in a company you have different kpi 0:56:06.119,0:56:08.079 right certain employees have certain kpi 0:56:08.079,0:56:11.280 that measures how good or how how uh you 0:56:11.280,0:56:13.200 know efficient or how effective a 0:56:13.200,0:56:16.240 particular employee is right so the 0:56:16.240,0:56:19.880 kpi kpi for your machine learning models 0:56:19.880,0:56:24.240 are Roc curve F1 score recall accuracy 0:56:24.240,0:56:26.599 okay and your confusion Matrix so so 0:56:26.599,0:56:29.839 fundamentally after I have built right 0:56:29.839,0:56:33.359 so here I've built my four different 0:56:33.359,0:56:35.240 models so after I built these form 0:56:35.240,0:56:37.640 different models I'm going to check and 0:56:37.640,0:56:39.680 evaluate them using all those different 0:56:39.680,0:56:42.440 metrics like for example the F1 score 0:56:42.440,0:56:44.839 the Precision score the recall score all 0:56:44.839,0:56:47.319 right so for this model I can check out 0:56:47.319,0:56:50.039 the ROC score the F1 score the Precision 0:56:50.039,0:56:52.119 score the recall score then for this 0:56:52.119,0:56:54.799 model this is the ROC score the F1 score 0:56:54.799,0:56:56.839 the Precision score the recall called 0:56:56.839,0:56:59.680 then for this model and so on so for 0:56:59.680,0:57:03.240 every single model I've created using my 0:57:03.240,0:57:05.839 training data set I will have all my set 0:57:05.839,0:57:08.000 of evaluation metrics that I can use to 0:57:08.000,0:57:11.839 evaluate how good this model is okay 0:57:11.839,0:57:13.119 same thing here I've got a confusion 0:57:13.119,0:57:15.079 Matrix here right so I can use that 0:57:15.079,0:57:18.119 again to evaluate between all these four 0:57:18.119,0:57:20.200 different models and then I kind of 0:57:20.200,0:57:22.240 summarize it up here so we can see from 0:57:22.240,0:57:25.440 this summary here that actually the top 0:57:25.440,0:57:27.599 two models right which are I'm going to 0:57:27.599,0:57:29.440 give a lot as a data scientist I'm now 0:57:29.440,0:57:31.119 going to just focus on these two models 0:57:31.119,0:57:33.440 so these two models are begging 0:57:33.440,0:57:36.000 classifier and random Forest classifier 0:57:36.000,0:57:38.480 they have the highest values of F1 score 0:57:38.480,0:57:40.480 and the highest values of the rooc curve 0:57:40.480,0:57:42.640 score okay so we can say these are the 0:57:42.640,0:57:45.839 top two models in terms of accuracy okay 0:57:45.839,0:57:48.920 using the fub1 evaluation metric and the 0:57:48.920,0:57:53.720 r Au evaluation metric okay so these 0:57:53.720,0:57:57.480 results uh kind of summarize here and 0:57:57.480,0:57:59.079 then we use different sampling 0:57:59.079,0:58:00.880 techniques okay so just now I talked 0:58:00.880,0:58:03.680 about um different kinds of sampling 0:58:03.680,0:58:06.400 techniques and so the idea of different 0:58:06.400,0:58:08.319 kinds of sampling techniques is to just 0:58:08.319,0:58:11.319 get a different feel for different 0:58:11.319,0:58:13.720 distributions of the data in different 0:58:13.720,0:58:16.359 areas of your data set so that you want 0:58:16.359,0:58:20.000 to just kind of make sure that your your 0:58:20.000,0:58:22.799 your evaluation of accuracy is actually 0:58:22.799,0:58:27.079 statistically correct right so we can um 0:58:27.079,0:58:29.599 do what is called oversampling and under 0:58:29.599,0:58:30.880 sampling which is very useful when 0:58:30.880,0:58:32.280 you're working with an imbalance data 0:58:32.280,0:58:35.039 set so this is example of doing that and 0:58:35.039,0:58:37.240 then here we again again check out the 0:58:37.240,0:58:38.799 results for all these different 0:58:38.799,0:58:41.680 techniques we use uh the F1 score the Au 0:58:41.680,0:58:43.599 score all right these are the two key 0:58:43.599,0:58:46.760 measures of accuracy right so and then 0:58:46.760,0:58:47.920 we can check out the scores for the 0:58:47.920,0:58:50.480 different approaches okay so we can see 0:58:50.480,0:58:53.119 oh well overall the models have lower Au 0:58:53.119,0:58:55.720 r r Au C score but they have a much 0:58:55.720,0:58:58.280 higher F1 score the begging classifier 0:58:58.280,0:59:00.839 had the highest R1 highest roc1 score 0:59:00.839,0:59:04.119 but F1 score was too low okay then in 0:59:04.119,0:59:06.520 the data scientist opinion the random 0:59:06.520,0:59:08.520 forest with this particular technique of 0:59:08.520,0:59:10.760 sampling has equilibrium between the F1 0:59:10.760,0:59:14.480 R F1 R and A score so the takeaway one 0:59:14.480,0:59:16.680 is the macro F1 score improves 0:59:16.680,0:59:18.480 dramatically using the sampl sampling 0:59:18.480,0:59:20.160 techniqu so these models might be better 0:59:20.160,0:59:22.440 compared to the balanced ones all right 0:59:22.440,0:59:26.280 so based on all this uh evaluation the 0:59:26.280,0:59:27.680 data scientist says they're going to 0:59:27.680,0:59:29.920 continue to work with these two models 0:59:29.920,0:59:31.440 all right and the balance begging one 0:59:31.440,0:59:33.079 and then continue to make further 0:59:33.079,0:59:35.039 comparisons all right so then we 0:59:35.039,0:59:37.079 continue to keep refining on our 0:59:37.079,0:59:38.599 evaluation work here we're going to 0:59:38.599,0:59:41.000 train the models one more time again so 0:59:41.000,0:59:43.039 we again do a training test plate and 0:59:43.039,0:59:44.799 then we do that for this particular uh 0:59:44.799,0:59:47.039 approach model and then we print out we 0:59:47.039,0:59:48.200 print out what is called a 0:59:48.200,0:59:50.960 classification report and this is 0:59:50.960,0:59:53.400 basically a summary of all those metrics 0:59:53.400,0:59:55.359 that I talk about just now so just now 0:59:55.359,0:59:57.520 remember I said the the there was 0:59:57.520,0:59:59.680 several evaluation metrics right so uh 0:59:59.680,1:00:01.480 we had the confusion matrics the 1:00:01.480,1:00:04.119 accuracy the Precision the recall the Au 1:00:04.119,1:00:08.119 ccore so here with the um classification 1:00:08.119,1:00:09.880 report I can get a summary of all of 1:00:09.880,1:00:11.760 that so I can see all the values here 1:00:11.760,1:00:14.640 okay for this particular model begging 1:00:14.640,1:00:17.160 Tomac links and then I can do that for 1:00:17.160,1:00:18.640 another model the random Forest 1:00:18.640,1:00:20.599 borderline SME and then I can do that 1:00:20.599,1:00:22.200 for another model which is the balance 1:00:22.200,1:00:25.160 ping so again we see this a lot of 1:00:25.160,1:00:27.079 comparison between different models 1:00:27.079,1:00:28.640 trying to figure out what all these 1:00:28.640,1:00:30.720 evaluation metrics are telling us all 1:00:30.720,1:00:32.960 right then again we have a confusion 1:00:32.960,1:00:35.880 Matrix so we generate a confusion Matrix 1:00:35.880,1:00:38.880 for the bagging with the toac links 1:00:38.880,1:00:40.720 under sampling for the random followers 1:00:40.720,1:00:42.680 with the borderline mod over sampling 1:00:42.680,1:00:44.960 and just balance begging by itself then 1:00:44.960,1:00:47.720 again we compare between these three uh 1:00:47.720,1:00:50.799 models uh using the confusion Matrix 1:00:50.799,1:00:52.599 evaluation Matrix and then we can kind 1:00:52.599,1:00:55.680 of come to some conclusions all right so 1:00:55.680,1:00:58.160 right so now we look at all the data 1:00:58.160,1:01:01.200 then we move on and look at another um 1:01:01.200,1:01:03.160 another kind of evaluation metrix which 1:01:03.160,1:01:06.720 is the r score right so this is one of 1:01:06.720,1:01:08.680 the other evaluation metrics I talk 1:01:08.680,1:01:11.200 about so this one is a kind of a curve 1:01:11.200,1:01:12.520 you look at it to see the area 1:01:12.520,1:01:14.359 underneath the curve this is called AOC 1:01:14.359,1:01:18.079 R area under the curve sorry Au Au R 1:01:18.079,1:01:19.880 area under the curve all right so the 1:01:19.880,1:01:21.839 area under the curve uh 1:01:21.839,1:01:24.319 score will give us some idea about the 1:01:24.319,1:01:25.599 threshold that we're going to use for 1:01:25.599,1:01:27.680 classif ification so we can examine this 1:01:27.680,1:01:29.200 for the bagging classifier for the 1:01:29.200,1:01:30.960 random forest classifier for the balance 1:01:30.960,1:01:33.599 bagging classifier okay then we can also 1:01:33.599,1:01:36.200 again do that uh finally we can check 1:01:36.200,1:01:37.880 the classification report of this 1:01:37.880,1:01:39.680 particular model so we keep doing this 1:01:39.680,1:01:43.200 over and over again evaluating this m 1:01:43.200,1:01:45.720 The Matrix the the accuracy Matrix the 1:01:45.720,1:01:46.880 evaluation Matrix for all these 1:01:46.880,1:01:48.880 different models so we keep doing this 1:01:48.880,1:01:50.520 over and over again for different 1:01:50.520,1:01:53.440 thresholds or for classification and so 1:01:53.440,1:01:56.880 as we keep drilling into these we kind 1:01:56.880,1:02:00.839 of get more and more understanding of 1:02:00.839,1:02:02.799 all these different models which one is 1:02:02.799,1:02:04.760 the best one that gives the best 1:02:04.760,1:02:08.520 performance for our data set okay so 1:02:08.520,1:02:11.440 finally we come to this conclusion this 1:02:11.440,1:02:13.520 particular model is not able to reduce 1:02:13.520,1:02:15.279 the record on failure test than 1:02:15.279,1:02:17.520 95.8% on the other hand balance begging 1:02:17.520,1:02:19.400 with a decision thresold of 0.6 is able 1:02:19.400,1:02:21.520 to have a better recall blah blah blah 1:02:21.520,1:02:25.319 Etc so finally after having done all of 1:02:25.319,1:02:27.480 this evalu ations 1:02:27.480,1:02:31.119 okay this is the conclusion 1:02:31.119,1:02:33.960 so after having gone so right now we 1:02:33.960,1:02:35.279 have gone through all the steps of the 1:02:35.279,1:02:37.760 Machining learning life cycle and which 1:02:37.760,1:02:40.240 means we have right now or the data 1:02:40.240,1:02:41.960 scientist right now has gone through all 1:02:41.960,1:02:43.000 these 1:02:43.000,1:02:47.079 steps uh which is now we have done this 1:02:47.079,1:02:48.640 validation so we have done the cleaning 1:02:48.640,1:02:50.559 exploration preparation transformation 1:02:50.559,1:02:52.599 the future engineering we have developed 1:02:52.599,1:02:54.359 and trained multiple models we have 1:02:54.359,1:02:56.480 evaluated all these different models so 1:02:56.480,1:02:58.599 right now we have reached this stage so 1:02:58.599,1:03:02.720 at this stage we as the data scientist 1:03:02.720,1:03:05.480 kind of have completed our job so we've 1:03:05.480,1:03:08.119 come to some very useful conclusions 1:03:08.119,1:03:09.640 which we now can share with our 1:03:09.640,1:03:13.240 colleagues all right and based on this 1:03:13.240,1:03:15.400 uh conclusions or recommendations 1:03:15.400,1:03:17.160 somebody is going to choose a 1:03:17.160,1:03:19.160 appropriate model and that model is 1:03:19.160,1:03:22.640 going to get deployed for realtime use 1:03:22.640,1:03:25.319 in a real life production environment 1:03:25.319,1:03:27.240 okay and that decision is going to be 1:03:27.240,1:03:29.359 made based on the recommendations coming 1:03:29.359,1:03:30.880 from the data scientist at the end of 1:03:30.880,1:03:33.480 this phase okay so at the end of this 1:03:33.480,1:03:35.079 phase the data scientist is going to 1:03:35.079,1:03:36.880 come up with these conclusions so 1:03:36.880,1:03:41.760 conclusions is okay if the engineering 1:03:41.760,1:03:44.520 team they are looking okay the 1:03:44.520,1:03:46.119 engineering team right the engineering 1:03:46.119,1:03:48.720 team if they are looking for the highest 1:03:48.720,1:03:51.839 failure detection rate possible then 1:03:51.839,1:03:54.480 they should go with this particular 1:03:54.480,1:03:56.520 model okay 1:03:56.520,1:03:58.680 and if they want a balance between 1:03:58.680,1:04:01.039 precision and recall then they should 1:04:01.039,1:04:03.240 choose between the begging model with a 1:04:03.240,1:04:05.960 0.4 decision threshold or the random 1:04:05.960,1:04:09.599 forest model with a 0.5 threshold but if 1:04:09.599,1:04:11.880 they don't care so much about predicting 1:04:11.880,1:04:14.480 every failure and they want the highest 1:04:14.480,1:04:16.760 Precision possible then they should opt 1:04:16.760,1:04:19.799 for the begging toax link classifier 1:04:19.799,1:04:23.160 with a bit higher decision threshold and 1:04:23.160,1:04:26.160 so this is the key thing that the data 1:04:26.160,1:04:28.319 scientist is going to give right this is 1:04:28.319,1:04:30.760 the key takeaway this is the kind of the 1:04:30.760,1:04:32.680 end result of the entire machine 1:04:32.680,1:04:34.680 learning life cycle right now the data 1:04:34.680,1:04:36.400 scientist is going to tell the 1:04:36.400,1:04:38.599 engineering team all right you guys 1:04:38.599,1:04:41.160 which is more important for you point a 1:04:41.160,1:04:45.039 point B or Point C make your decision so 1:04:45.039,1:04:47.400 the engineering team will then discuss 1:04:47.400,1:04:48.960 among themselves and say hey you know 1:04:48.960,1:04:52.279 what what we want is we want to get the 1:04:52.279,1:04:54.720 highest failure detection possible 1:04:54.720,1:04:58.359 because any kind kind of failure of that 1:04:58.359,1:05:00.400 machine or the product on the samply 1:05:00.400,1:05:03.119 line is really going to screw us up big 1:05:03.119,1:05:05.640 time so what we're looking for is the 1:05:05.640,1:05:08.079 model that will give us the highest 1:05:08.079,1:05:10.880 failure detection rate we don't care 1:05:10.880,1:05:13.480 about Precision but we want to be make 1:05:13.480,1:05:15.440 sure that if there's a failure we are 1:05:15.440,1:05:17.720 going to catch it right so that's what 1:05:17.720,1:05:19.599 they want and so the data scientist will 1:05:19.599,1:05:22.200 say Hey you go for the balance begging 1:05:22.200,1:05:24.880 model okay then the data scientist saves 1:05:24.880,1:05:27.720 this all right uh and then once you have 1:05:27.720,1:05:30.000 saved this uh you can then go right 1:05:30.000,1:05:32.319 ahead and deploy that so you can go 1:05:32.319,1:05:33.520 right ahead and deploy that to 1:05:33.520,1:05:37.160 production okay and so if you want to 1:05:37.160,1:05:38.839 continue we can actually further 1:05:38.839,1:05:41.119 continue this modeling problem so just 1:05:41.119,1:05:43.480 now I model this problem as a binary 1:05:43.480,1:05:46.720 classification problem uh sorry just I 1:05:46.720,1:05:48.240 modeled this problem as a binary 1:05:48.240,1:05:49.520 classification which means it's either 1:05:49.520,1:05:51.680 zero or one either fail or not fail but 1:05:51.680,1:05:53.599 we can also model it as a multiclass 1:05:53.599,1:05:55.640 classification problem right because as 1:05:55.640,1:05:57.640 as I said earlier just now for the 1:05:57.640,1:06:00.200 Target variable colum which is sorry for 1:06:00.200,1:06:02.520 the failure type colume you actually 1:06:02.520,1:06:04.839 have multiple kinds of failures right 1:06:04.839,1:06:07.559 for example you may have a power failure 1:06:07.559,1:06:10.000 uh you may have a towar failure uh you 1:06:10.000,1:06:12.920 may have a overstrain failure so now we 1:06:12.920,1:06:14.839 can model the problem slightly 1:06:14.839,1:06:17.240 differently so we can model it as a 1:06:17.240,1:06:19.680 multiclass classification problem and 1:06:19.680,1:06:21.160 then we go through the entire same 1:06:21.160,1:06:22.680 process that we went through just now so 1:06:22.680,1:06:24.880 we create different models we test this 1:06:24.880,1:06:26.720 out but now the confusion Matrix is for 1:06:26.720,1:06:30.119 a multiclass classification isue right 1:06:30.119,1:06:30.960 so we're going 1:06:30.960,1:06:34.039 to check them out we're going to again 1:06:34.039,1:06:36.079 uh try different algorithms or models 1:06:36.079,1:06:38.039 again train and test our data set do the 1:06:38.039,1:06:39.760 training test split uh on these 1:06:39.760,1:06:42.000 different models all right so we have 1:06:42.000,1:06:43.400 like for example we have bon random 1:06:43.400,1:06:46.160 Forest B random Forest a great search 1:06:46.160,1:06:47.720 then you train the models using what is 1:06:47.720,1:06:49.680 called hyperparameter tuning then you 1:06:49.680,1:06:51.079 get the scores all right so you get the 1:06:51.079,1:06:53.160 same evaluation scores again you check 1:06:53.160,1:06:54.599 out the evaluation scores compare 1:06:54.599,1:06:57.079 between them generate a confusion Matrix 1:06:57.079,1:06:59.960 so this is a multiclass confusion Matrix 1:06:59.960,1:07:02.400 and then you come to the final 1:07:02.400,1:07:05.760 conclusion so now if you are interested 1:07:05.760,1:07:09.000 to frame your problem domain as a 1:07:09.000,1:07:11.359 multiclass classification problem all 1:07:11.359,1:07:13.839 right then these are the recommendations 1:07:13.839,1:07:15.480 from the data scientist so the data 1:07:15.480,1:07:17.240 scientist will say you know what I'm 1:07:17.240,1:07:19.559 going to pick this particular model the 1:07:19.559,1:07:22.039 balance backing classifier and these are 1:07:22.039,1:07:24.520 all the reasons that the data scientist 1:07:24.520,1:07:27.279 is going to give as a rational for 1:07:27.279,1:07:29.400 selecting this particular 1:07:29.400,1:07:32.039 model and then once that's done you save 1:07:32.039,1:07:35.000 the model and that's that's it that's it 1:07:35.000,1:07:38.920 so that's all done now and so then the 1:07:38.920,1:07:41.039 uh the model the machine learning model 1:07:41.039,1:07:43.720 now you can put it live run it on the 1:07:43.720,1:07:45.279 server and now the machine learning 1:07:45.279,1:07:47.200 model is ready to work which means it's 1:07:47.200,1:07:48.920 ready to generate predictions right 1:07:48.920,1:07:50.279 that's the main job of the machine 1:07:50.279,1:07:52.039 learning model you have picked the best 1:07:52.039,1:07:53.680 machine learning model with the best 1:07:53.680,1:07:55.799 evaluation metrics for whatever accur 1:07:55.799,1:07:57.760 see goal you're trying to achieve and 1:07:57.760,1:07:59.640 now you're going to run it on a server 1:07:59.640,1:08:00.799 and now you're going to get all this 1:08:00.799,1:08:02.960 real time data that's coming from your 1:08:02.960,1:08:04.520 sensus you're going to pump that into 1:08:04.520,1:08:06.359 your machine learning model your machine 1:08:06.359,1:08:07.880 learning model will pump out a whole 1:08:07.880,1:08:09.520 bunch of predictions and we're going to 1:08:09.520,1:08:12.799 use that predictions in real time to 1:08:12.799,1:08:15.400 make real time real world decision 1:08:15.400,1:08:17.560 making right you're going to say okay 1:08:17.560,1:08:19.600 I'm predicting that that machine is 1:08:19.600,1:08:23.198 going to fail on Thursday at 5:00 p.m. 1:08:23.198,1:08:25.520 so you better get your service folks in 1:08:25.520,1:08:28.640 to service it on Thursday 2: p.m. or you 1:08:28.640,1:08:31.640 know whatever so you can you know uh 1:08:31.640,1:08:33.479 make decisions on when you want to do 1:08:33.479,1:08:35.319 your maintenance you know and and make 1:08:35.319,1:08:37.640 the best decisions to optimize the cost 1:08:37.640,1:08:41.158 of Maintenance etc etc and then based on 1:08:41.158,1:08:42.120 the 1:08:42.120,1:08:45.000 results that are coming up from the 1:08:45.000,1:08:46.759 predictions so the predictions may be 1:08:46.759,1:08:49.120 good the predictions may be lousy the 1:08:49.120,1:08:51.359 predictions may be average right so we 1:08:51.359,1:08:53.719 are we're constantly monitoring how good 1:08:53.719,1:08:55.439 or how useful are the predictions 1:08:55.439,1:08:57.759 generated by this realtime model that's 1:08:57.759,1:08:59.880 running on the server and based on our 1:08:59.880,1:09:02.679 monitoring we will then take some new 1:09:02.679,1:09:05.319 data and then repeat this entire life 1:09:05.319,1:09:07.040 cycle again so this is basically a 1:09:07.040,1:09:09.238 workflow that's iterative and we are 1:09:09.238,1:09:11.120 constantly or the data scientist is 1:09:11.120,1:09:13.319 constantly getting in all these new data 1:09:13.319,1:09:15.279 points and then refining the model 1:09:15.279,1:09:17.960 picking maybe a new model deploying the 1:09:17.960,1:09:21.679 new model onto the server and so on all 1:09:21.679,1:09:23.920 right and so that's it so that is 1:09:23.920,1:09:26.399 basically your machine learning workflow 1:09:26.399,1:09:29.479 in a nutshell okay so for this 1:09:29.479,1:09:32.080 particular approach we have used a bunch 1:09:32.080,1:09:34.560 of uh data science libraries from python 1:09:34.560,1:09:36.520 so we have used pandas which is the most 1:09:36.520,1:09:38.560 B basic data science libraries that 1:09:38.560,1:09:40.279 provides all the tools to work with raw 1:09:40.279,1:09:42.520 data we have used numai which is a high 1:09:42.520,1:09:44.080 performance library for implementing 1:09:44.080,1:09:46.439 complex array metrix operations we have 1:09:46.439,1:09:49.560 used met plot lip and cbon which is used 1:09:49.560,1:09:52.439 for doing the Eda the explorat 1:09:52.439,1:09:55.560 exploratory data analysis phase machine 1:09:55.560,1:09:57.040 learning where you visualize all your 1:09:57.040,1:09:59.040 data we have used psyit learn which is 1:09:59.040,1:10:01.280 the machine L learning library to do all 1:10:01.280,1:10:02.920 your implementation for all your call 1:10:02.920,1:10:06.000 machine learning algorithms uh we we we 1:10:06.000,1:10:08.000 have not used this because this is not a 1:10:08.000,1:10:11.040 deep learning uh problem but if you are 1:10:11.040,1:10:12.800 working with a deep learning problem 1:10:12.800,1:10:15.360 like image classification image 1:10:15.360,1:10:17.840 recognition object detection okay 1:10:17.840,1:10:20.199 natural language processing text 1:10:20.199,1:10:21.920 classification well then you're going to 1:10:21.920,1:10:24.360 use these libraries from python which is 1:10:24.360,1:10:28.960 tensor flow okay and also py 1:10:28.960,1:10:32.679 to and then lastly that whole thing that 1:10:32.679,1:10:34.719 whole data science project that you saw 1:10:34.719,1:10:36.800 just now this entire data science 1:10:36.800,1:10:38.880 project is actually developed in 1:10:38.880,1:10:41.080 something called a Jupiter notebook so 1:10:41.080,1:10:44.040 all this python code along with all the 1:10:44.040,1:10:46.360 observations from the data 1:10:46.360,1:10:48.679 scientists okay for this entire data 1:10:48.679,1:10:50.440 science project was actually run in 1:10:50.440,1:10:53.360 something called a Jupiter notebook so 1:10:53.360,1:10:55.760 that is uh the 1:10:55.760,1:10:59.080 most widely used tool for interactively 1:10:59.080,1:11:02.360 developing and presenting data science 1:11:02.360,1:11:04.640 projects okay so that brings me to the 1:11:04.640,1:11:07.400 end of this entire presentation I hope 1:11:07.400,1:11:10.360 that you find it useful for you and that 1:11:10.360,1:11:13.199 you can appreciate the importance of 1:11:13.199,1:11:15.280 machine learning and how it can be 1:11:15.280,1:11:19.800 applied in a real life use case in a 1:11:19.800,1:11:23.360 typical production environment all right 1:11:23.360,1:11:27.239 thank you all so much for watching