WEBVTT 00:00:01.199 --> 00:00:03.760 Hello everyone, my name is Victor. I'm 00:00:03.760 --> 00:00:05.359 your friendly neighborhood data 00:00:05.359 --> 00:00:07.759 scientist from DreamCatcher. So in this 00:00:07.759 --> 00:00:10.160 presentation, I would like to talk about 00:00:10.160 --> 00:00:12.759 a specific industry use case of AI or 00:00:12.759 --> 00:00:15.069 machine learning which is predictive 00:00:15.069 --> 00:00:19.000 maintenance. So I will be covering these 00:00:19.000 --> 00:00:21.320 topics and feel free to jump forward to 00:00:21.320 --> 00:00:23.359 the specific part in the video where I 00:00:23.359 --> 00:00:25.160 talk about all these topics. So I'm going 00:00:25.160 --> 00:00:27.160 to start off with a general preview of 00:00:27.160 --> 00:00:29.080 AI and machine learning. Then, I'll 00:00:29.080 --> 00:00:30.840 discuss the use case which is predictive 00:00:30.840 --> 00:00:32.719 maintenance. I'll talk about the basics 00:00:32.719 --> 00:00:34.800 of machine learning, the workflow of 00:00:34.800 --> 00:00:37.239 machine learning, and then we will come 00:00:37.239 --> 00:00:40.760 to the meat of this presentation which 00:00:40.760 --> 00:00:43.680 is essentially a demonstration of the 00:00:43.680 --> 00:00:45.399 machine learning workflow from end to 00:00:45.399 --> 00:00:47.579 end on a real life predictive 00:00:47.579 --> 00:00:51.520 maintenance domain problem. All right, so 00:00:51.520 --> 00:00:53.640 without any further ado, let's jump into 00:00:53.640 --> 00:00:56.680 it. So let's start off with a quick 00:00:56.680 --> 00:01:00.079 preview of AI and machine learning. Well 00:01:00.079 --> 00:01:03.600 AI is a very general term, it encompasses 00:01:03.600 --> 00:01:06.680 the entire area of science and 00:01:06.680 --> 00:01:09.040 engineering that is related to creating 00:01:09.040 --> 00:01:10.840 software programs and machines that 00:01:10.840 --> 00:01:13.759 will be capable of performing tasks 00:01:13.759 --> 00:01:16.080 that would normally require human 00:01:16.080 --> 00:01:19.600 intelligence. But AI is a catchall term, 00:01:19.600 --> 00:01:22.920 so really when we talk about apply AI, 00:01:22.920 --> 00:01:25.920 how we use AI in our daily work, we are 00:01:25.920 --> 00:01:27.720 really going to be talking about machine 00:01:27.720 --> 00:01:30.000 learning. So machine learning is the 00:01:30.000 --> 00:01:31.680 design and application of software 00:01:31.680 --> 00:01:34.079 algorithms that are capable of learning 00:01:34.079 --> 00:01:37.960 on their own without any explicit human 00:01:37.960 --> 00:01:40.399 intervention. And the primary purpose of 00:01:40.399 --> 00:01:43.280 these algorithms are to optimize 00:01:43.280 --> 00:01:46.840 performance in a specific task. And the 00:01:46.840 --> 00:01:49.680 primary performance or the primary task 00:01:49.680 --> 00:01:52.000 that you want to optimize performance in 00:01:52.000 --> 00:01:54.240 is to be able to make accurate 00:01:54.240 --> 00:01:57.479 predictions about future outcomes based 00:01:57.479 --> 00:02:00.560 on the analysis of historical data 00:02:00.560 --> 00:02:02.960 from the past. So essentially machine 00:02:02.960 --> 00:02:05.320 learning is about making predictions 00:02:05.320 --> 00:02:06.880 about the future or what we call 00:02:06.880 --> 00:02:08.919 predictive analytics. 00:02:08.919 --> 00:02:11.000 And there are many different 00:02:11.000 --> 00:02:12.720 kinds of algorithms that are available in 00:02:12.720 --> 00:02:14.519 machine learning under the three primary 00:02:14.519 --> 00:02:16.440 categories of supervised learning, 00:02:16.440 --> 00:02:18.920 unsupervised learning, and reinforcement 00:02:18.920 --> 00:02:21.440 learning. And here we can see some of the 00:02:21.440 --> 00:02:23.560 different kinds of algorithms and their 00:02:23.560 --> 00:02:27.480 use cases in various areas in 00:02:27.480 --> 00:02:29.680 industry. So we have various domain use 00:02:29.680 --> 00:02:30.480 cases 00:02:30.480 --> 00:02:31.800 for all these different kind of 00:02:31.800 --> 00:02:33.840 algorithms, and we can see that different 00:02:33.840 --> 00:02:38.120 algorithms are fitted for different use cases. 00:02:38.120 --> 00:02:41.000 Deep learning is an advanced form 00:02:41.000 --> 00:02:42.400 of machine learning that's based on 00:02:42.400 --> 00:02:44.280 something called an artificial neural 00:02:44.280 --> 00:02:46.319 network or ANN for short, and this 00:02:46.319 --> 00:02:47.840 essentially simulates the structure of 00:02:47.840 --> 00:02:49.519 the human brain whereby neurons 00:02:49.519 --> 00:02:51.360 interconnect and work together to 00:02:51.360 --> 00:02:54.959 process and learn new information. So DL 00:02:54.959 --> 00:02:57.239 is the foundational technology for most 00:02:57.239 --> 00:02:59.360 of the popular AI tools that you 00:02:59.360 --> 00:03:01.400 probably have heard of today. So I'm sure 00:03:01.400 --> 00:03:03.200 you have heard of ChatGPT if you haven't 00:03:03.200 --> 00:03:05.360 been living in a cave for the past 2 00:03:05.360 --> 00:03:08.280 years. And yeah, so ChatGPT is an example 00:03:08.280 --> 00:03:10.120 of what we call a large language model 00:03:10.120 --> 00:03:11.599 and that's based on this technology 00:03:11.599 --> 00:03:14.879 called deep learning. Also, all the modern 00:03:14.879 --> 00:03:17.440 computer vision applications where a 00:03:17.440 --> 00:03:20.040 computer program can classify images or 00:03:20.040 --> 00:03:23.239 detect images or recognize images on 00:03:23.239 --> 00:03:25.280 its own, okay, we call this computer 00:03:25.280 --> 00:03:27.760 vision applications. They also use 00:03:27.760 --> 00:03:29.519 this particular form of machine learning 00:03:29.519 --> 00:03:31.560 called deep learning, right? So this is a 00:03:31.560 --> 00:03:33.640 example of an artificial neural network. 00:03:33.640 --> 00:03:35.200 For example, here I have an image of a 00:03:35.200 --> 00:03:37.159 bird that's fed into this artificial 00:03:37.159 --> 00:03:39.560 neural network, and output from this 00:03:39.560 --> 00:03:41.239 artificial neural network is a 00:03:41.239 --> 00:03:43.959 classification of this image into one of 00:03:43.959 --> 00:03:46.400 these three potential categories. So in 00:03:46.400 --> 00:03:49.080 this case, if the ANN has been trained 00:03:49.080 --> 00:03:51.799 properly, we fit in this image, this 00:03:51.799 --> 00:03:54.079 ANN should correctly classify this image 00:03:54.079 --> 00:03:56.879 as a bird, right? So this is a image 00:03:56.879 --> 00:03:58.959 classification problem which is a 00:03:58.959 --> 00:04:01.079 classic use case for an artificial 00:04:01.079 --> 00:04:03.929 neural network in the field of computer 00:04:03.929 --> 00:04:07.879 vision. And just like in the case of 00:04:07.879 --> 00:04:09.400 machine learning, there are a variety of 00:04:09.400 --> 00:04:11.640 algorithms that are available for 00:04:11.640 --> 00:04:13.599 deep learning under the category of 00:04:13.599 --> 00:04:15.000 supervised learning and also 00:04:15.000 --> 00:04:16.839 unsupervised learning. 00:04:16.839 --> 00:04:19.199 All right, so this is how we can 00:04:19.199 --> 00:04:20.839 kind of categorize this. You can think of 00:04:20.839 --> 00:04:23.880 AI is a general area of smart systems 00:04:23.880 --> 00:04:26.560 and machine. Machine learning is 00:04:26.560 --> 00:04:29.360 basically apply AI and deep learning 00:04:29.360 --> 00:04:29.823 is a 00:04:29.823 --> 00:04:32.560 subspecialization of machine learning 00:04:32.560 --> 00:04:35.000 using a particular architecture called 00:04:35.000 --> 00:04:38.760 an artificial neural network. 00:04:38.760 --> 00:04:42.160 And generative AI, so if you talk 00:04:42.160 --> 00:04:45.280 about ChatGPT, okay, Google Gemini, 00:04:45.280 --> 00:04:47.639 Microsoft Copilot, okay, all these 00:04:47.639 --> 00:04:49.600 examples of generative AI, they are 00:04:49.600 --> 00:04:51.600 basically large language models, and they 00:04:51.600 --> 00:04:53.880 are a further subcategory within the 00:04:53.880 --> 00:04:55.170 area of deep 00:04:55.170 --> 00:04:57.759 learning. And there are many applications 00:04:57.759 --> 00:04:59.400 of machine learning in industry right 00:04:59.400 --> 00:05:01.759 now, so pick which particular industry 00:05:01.759 --> 00:05:03.680 are you involved in, and these are all the 00:05:03.680 --> 00:05:05.060 specific areas of 00:05:05.060 --> 00:05:09.960 applications, right? So probably, I'm 00:05:09.960 --> 00:05:11.680 going to guess the vast majority of you 00:05:11.680 --> 00:05:12.880 who are watching this video, you're 00:05:12.880 --> 00:05:14.360 probably coming from the manufacturing 00:05:14.360 --> 00:05:16.639 industry, and so in the manufacturing 00:05:16.639 --> 00:05:18.479 industry some of the standard use cases 00:05:18.479 --> 00:05:20.039 for machine learning and deep learning 00:05:20.039 --> 00:05:23.080 are predicting potential problems, okay? 00:05:23.080 --> 00:05:25.319 So sometimes you call this predictive 00:05:25.319 --> 00:05:27.160 maintenance where you want to predict 00:05:27.160 --> 00:05:28.800 when a problem is going to happen and 00:05:28.800 --> 00:05:30.400 then kind of address it before it 00:05:30.400 --> 00:05:32.759 happens. And then monitoring systems, 00:05:32.759 --> 00:05:35.199 automating your manufacturing assembly 00:05:35.199 --> 00:05:37.880 line or production line, okay, smart 00:05:37.880 --> 00:05:40.120 scheduling, and detecting anomaly on your 00:05:40.120 --> 00:05:41.480 production line. 00:05:42.390 --> 00:05:44.160 Okay, so let's talk about the use 00:05:44.160 --> 00:05:45.680 case here which is predictive 00:05:45.680 --> 00:05:49.280 maintenance, right? So what is predictive 00:05:49.280 --> 00:05:51.720 maintenance? Well predictive maintenance, 00:05:51.720 --> 00:05:53.199 here's the long definition, is a 00:05:53.199 --> 00:05:54.639 equipment maintenance strategy that 00:05:54.639 --> 00:05:56.280 relies on real-time monitoring of 00:05:56.280 --> 00:05:58.360 equipment conditions and data to predict 00:05:58.360 --> 00:06:00.280 equipment failures in advance. 00:06:00.280 --> 00:06:02.680 And this uses advanced data models, 00:06:02.680 --> 00:06:05.240 analytics, and machine learning whereby 00:06:05.240 --> 00:06:07.479 we can reliably assess when failures are 00:06:07.479 --> 00:06:09.199 more likely to occur, including which 00:06:09.199 --> 00:06:11.120 components are more likely to be 00:06:11.120 --> 00:06:13.560 affected on your production or assembly 00:06:13.560 --> 00:06:16.599 line. So where does predictive 00:06:16.599 --> 00:06:18.759 maintenance fit into the overall scheme 00:06:18.759 --> 00:06:20.759 of things, right? So let's talk about the 00:06:20.759 --> 00:06:23.039 kind of standard way that, you know, 00:06:23.039 --> 00:06:25.520 factories or production 00:06:25.520 --> 00:06:27.680 lines, assembly lines in factories tend 00:06:27.680 --> 00:06:31.080 to handle maintenance issues say 00:06:31.080 --> 00:06:33.120 10 or 20 years ago, right? So what you 00:06:33.120 --> 00:06:34.520 have is the, what you would probably 00:06:34.520 --> 00:06:36.400 start off is the most basic mode 00:06:36.400 --> 00:06:38.240 which is reactive maintenance. So you 00:06:38.240 --> 00:06:40.680 just wait until your machine breaks down 00:06:40.680 --> 00:06:43.039 and then you repair, right? The simplest, 00:06:43.039 --> 00:06:44.720 but, of course, I'm sure if you have worked on a 00:06:44.720 --> 00:06:46.720 production line for any period of time, 00:06:46.720 --> 00:06:48.880 you know that this reactive maintenance 00:06:48.880 --> 00:06:50.759 can give you a whole bunch of headaches 00:06:50.759 --> 00:06:52.160 especially if the machine breaks down 00:06:52.160 --> 00:06:54.120 just before a critical delivery deadline, 00:06:54.120 --> 00:06:55.520 right? Then you're going to have a 00:06:55.520 --> 00:06:56.800 backlog of orders and you're going to 00:06:56.800 --> 00:06:59.160 run to a lot of problems. Okay, so we move on 00:06:59.160 --> 00:07:00.879 to preventive maintenance which is 00:07:00.879 --> 00:07:03.840 you regularly schedule a maintenance of 00:07:03.840 --> 00:07:07.000 your production machines to reduce 00:07:07.000 --> 00:07:08.800 the failure rate. So you might do 00:07:08.800 --> 00:07:10.520 maintenance once every month, once every 00:07:10.520 --> 00:07:13.120 two weeks, whatever. Okay, this is great, 00:07:13.120 --> 00:07:15.240 but the problem, of course, then is well 00:07:15.240 --> 00:07:16.199 sometimes you're doing too much 00:07:16.199 --> 00:07:18.400 maintenance, it's not really necessary, 00:07:18.400 --> 00:07:20.639 and it still doesn't totally prevent 00:07:20.639 --> 00:07:23.240 this, you know, a failure of the 00:07:23.240 --> 00:07:25.639 machine that occurs outside of your planned 00:07:25.639 --> 00:07:28.680 maintenance, right? So a bit of an 00:07:28.680 --> 00:07:31.160 improvement, but not that much better. 00:07:31.160 --> 00:07:33.280 And then, these last two categories is 00:07:33.280 --> 00:07:34.680 where we bring in AI and machine 00:07:34.680 --> 00:07:36.759 learning. So with machine learning, we're 00:07:36.759 --> 00:07:39.280 going to use sensors to do real-time 00:07:39.280 --> 00:07:41.759 monitoring of the data, and then using 00:07:41.759 --> 00:07:43.319 that data we're going to build a machine 00:07:43.319 --> 00:07:46.479 learning model which helps us to predict, 00:07:46.479 --> 00:07:50.000 with a reasonable level of accuracy, when 00:07:50.000 --> 00:07:52.520 the next failure is going to happen on 00:07:52.520 --> 00:07:54.440 your assembly or production line on a 00:07:54.440 --> 00:07:57.440 specific component or specific machine, 00:07:57.440 --> 00:07:59.520 right? So you just want to be predict to 00:07:59.520 --> 00:08:01.960 a high level of accuracy like maybe 00:08:01.960 --> 00:08:04.440 to the specific day, even the specific 00:08:04.440 --> 00:08:06.400 hour, or even minute itself when you 00:08:06.400 --> 00:08:08.360 expect that particular product to fail 00:08:08.360 --> 00:08:10.960 or the particular machine to fail. All 00:08:10.960 --> 00:08:12.639 right, so these are the advantages of 00:08:12.639 --> 00:08:14.879 predictive maintenance. It minimizes 00:08:14.879 --> 00:08:16.720 the occurrence of unscheduled downtime, it 00:08:16.720 --> 00:08:18.080 gives you a real-time overview of your 00:08:18.080 --> 00:08:19.919 current condition of assets, ensures 00:08:19.919 --> 00:08:22.680 minimal disruptions to productivity, 00:08:22.680 --> 00:08:24.720 optimizes time you spend on maintenance work, 00:08:24.720 --> 00:08:26.639 optimizes the use of spare parts, and so 00:08:26.639 --> 00:08:28.280 on. And of course there are some 00:08:28.280 --> 00:08:30.639 disadvantages, which is the 00:08:30.639 --> 00:08:32.559 primary one, you need a specialized set 00:08:32.559 --> 00:08:35.519 of skills among your engineers to 00:08:35.519 --> 00:08:37.719 understand and create machine learning 00:08:37.719 --> 00:08:40.599 models that can work on the real-time 00:08:40.599 --> 00:08:43.559 data that you're getting. Okay, so we're 00:08:43.559 --> 00:08:45.000 going to take a look at some real life 00:08:45.000 --> 00:08:47.200 use cases. So these are a bunch of links 00:08:47.200 --> 00:08:48.720 here, so if you navigate to these links 00:08:48.720 --> 00:08:50.120 here, you'll be able to get a look at 00:08:50.120 --> 00:08:54.360 some real life use cases of machine 00:08:54.360 --> 00:08:57.640 learning in predictive maintenance. So 00:08:57.640 --> 00:09:00.959 the IBM website, okay, gives you a look at 00:09:00.959 --> 00:09:04.880 a bunch of five use cases, so you can 00:09:04.880 --> 00:09:06.519 click on these links and follow up with 00:09:06.519 --> 00:09:08.279 them if you want to read more. Okay, this 00:09:08.279 --> 00:09:11.480 is waste management, manufacturing, okay, 00:09:11.480 --> 00:09:14.760 building services, and renewable energy, 00:09:14.760 --> 00:09:16.880 and also mining, right? So these are all 00:09:16.880 --> 00:09:18.279 use cases, if you want to know more about 00:09:18.279 --> 00:09:20.480 them, you can read up and follow them 00:09:20.480 --> 00:09:23.600 from this website. And this website 00:09:23.600 --> 00:09:25.760 gives, this is a pretty good website. I 00:09:25.760 --> 00:09:27.720 would really encourage you to just look 00:09:27.720 --> 00:09:28.880 through this if you're interested in 00:09:28.880 --> 00:09:31.160 predictive maintenance. So here, it tells 00:09:31.160 --> 00:09:34.279 you about, you know, an industry survey of 00:09:34.279 --> 00:09:36.360 predictive maintenance. We can see that a 00:09:36.360 --> 00:09:38.200 large portion of the industry, 00:09:38.200 --> 00:09:39.680 manufacturing industry agreed that 00:09:39.680 --> 00:09:41.360 predictive maintenance is a real need to 00:09:41.360 --> 00:09:43.959 stay competitive and predictive 00:09:43.959 --> 00:09:45.240 maintenance is essential for 00:09:45.240 --> 00:09:46.720 manufacturing industry and will gain 00:09:46.720 --> 00:09:48.279 additional strength in the future. So 00:09:48.279 --> 00:09:50.200 this is a survey that was done quite 00:09:50.200 --> 00:09:52.040 some time ago and this was the results 00:09:52.040 --> 00:09:53.880 that we got back. So we can see the vast 00:09:53.880 --> 00:09:55.720 majority of key industry players in the 00:09:55.720 --> 00:09:57.640 manufacturing sector, they consider 00:09:57.640 --> 00:09:59.000 predictive maintenance to be a very 00:09:59.000 --> 00:09:59.839 important 00:09:59.839 --> 00:10:01.600 activity that they want to 00:10:01.600 --> 00:10:04.519 incorporate into their workflow, right? 00:10:04.519 --> 00:10:07.720 And we can see here the kind of ROI that 00:10:07.720 --> 00:10:10.680 we expect on investment in predictive 00:10:10.680 --> 00:10:13.399 maintenance, so 45% reduction in downtime, 00:10:13.399 --> 00:10:17.120 25% growth in productivity, 75% fault 00:10:17.120 --> 00:10:19.480 elimination, 30% reduction in maintenance 00:10:19.480 --> 00:10:22.640 cost, okay? And best of all, if you really 00:10:22.640 --> 00:10:25.040 want to kind of take a look at examples, 00:10:25.040 --> 00:10:26.680 all right, so there are all these 00:10:26.680 --> 00:10:28.120 different companies that have 00:10:28.120 --> 00:10:30.160 significantly invested in predictive 00:10:30.160 --> 00:10:31.640 maintenance technology in their 00:10:31.640 --> 00:10:34.240 manufacturing processes. So PepsiCo, we 00:10:34.240 --> 00:10:38.965 have got Frito-Lay, General Motors, Mondi, Ecoplant, 00:10:38.965 --> 00:10:40.959 all right? So you can jump over here 00:10:40.959 --> 00:10:42.959 and take a look at some of these 00:10:42.959 --> 00:10:46.040 use cases. Let me perhaps, let me try and 00:10:46.040 --> 00:10:48.079 open this up, for example, Mondi, right? You 00:10:48.079 --> 00:10:51.880 can see Mondi has impl- oops. Mondi has used 00:10:51.880 --> 00:10:53.720 this particular piece of software 00:10:53.720 --> 00:10:55.839 called MATLAB, all right, or MathWorks 00:10:55.839 --> 00:10:59.760 sorry, to do predictive maintenance 00:10:59.760 --> 00:11:01.920 for their manufacturing processes using 00:11:01.920 --> 00:11:05.079 machine learning. And we can talk, you can 00:11:05.079 --> 00:11:07.680 study how they have used it, all right, 00:11:07.680 --> 00:11:09.000 and how it works, what was their 00:11:09.000 --> 00:11:10.920 challenge, all right, the problems they 00:11:10.920 --> 00:11:12.639 were facing, the solution that they use 00:11:12.639 --> 00:11:14.560 using this MathWorks Consulting piece of 00:11:14.560 --> 00:11:17.160 software, and data that they collected in 00:11:17.160 --> 00:11:20.399 a MATLAB database, all right, sorry 00:11:20.399 --> 00:11:23.639 in a Oracle database. 00:11:23.639 --> 00:11:26.399 So using MathWorks from MATLAB, all 00:11:26.399 --> 00:11:27.959 right, they were able to create a deep 00:11:27.959 --> 00:11:30.560 learning model to, you know, to 00:11:30.560 --> 00:11:32.839 solve this particular issue for their 00:11:32.839 --> 00:11:35.720 domain. So if you're interested, please, I 00:11:35.720 --> 00:11:37.639 strongly encourage you to read up on all 00:11:37.639 --> 00:11:40.440 these real life customer stories with 00:11:40.440 --> 00:11:43.403 showcase use cases for predictive 00:11:43.403 --> 00:11:48.240 maintenance. Okay, so that's it for 00:11:48.240 --> 00:11:52.200 real life use cases for predictive maintenance. 00:11:53.819 --> 00:11:56.600 Now in this topic, I'm 00:11:56.600 --> 00:11:58.000 going to talk about machine learning 00:11:58.000 --> 00:12:00.040 basics, so what is actually involved 00:12:00.040 --> 00:12:01.480 in machine learning, and I'm going to 00:12:01.480 --> 00:12:03.839 give a very quick, fast, conceptual, high 00:12:03.839 --> 00:12:05.920 level overview of machine learning, all 00:12:05.920 --> 00:12:09.000 right? So there are several categories of 00:12:09.000 --> 00:12:10.959 machine learning, supervised, unsupervised, 00:12:10.959 --> 00:12:13.000 semi-supervised, reinforcement, and deep 00:12:13.000 --> 00:12:15.880 learning, okay? And let's talk about the 00:12:15.880 --> 00:12:19.360 most common and widely used category of 00:12:19.360 --> 00:12:20.560 machine learning which is called 00:12:20.560 --> 00:12:25.040 supervised learning. So the particular use 00:12:25.040 --> 00:12:26.279 case here that I'm going to be 00:12:26.279 --> 00:12:28.560 discussing, predictive maintenance, it's 00:12:28.560 --> 00:12:31.320 basically a form of supervised learning. 00:12:31.320 --> 00:12:33.480 So how does supervised learning work? 00:12:33.480 --> 00:12:35.199 Well in supervised learning, you're going 00:12:35.199 --> 00:12:37.240 to create a machine learning model by 00:12:37.240 --> 00:12:39.360 providing what is called a labelled data 00:12:39.360 --> 00:12:41.680 set as a input to a machine learning 00:12:41.680 --> 00:12:44.680 program or algorithm. And this dataset 00:12:44.680 --> 00:12:46.440 is going to contain what is called an 00:12:46.440 --> 00:12:48.760 independent or feature variables, all 00:12:48.760 --> 00:12:51.240 right, so this will be a set of variables. 00:12:51.240 --> 00:12:52.959 And there will be one dependent or 00:12:52.959 --> 00:12:54.959 target variable which we also call the 00:12:54.959 --> 00:12:57.720 label, and the idea is that the 00:12:57.720 --> 00:12:59.839 independent or the feature variables are 00:12:59.839 --> 00:13:01.600 the attributes or properties of your 00:13:01.600 --> 00:13:04.160 data set that influence the dependent or 00:13:04.160 --> 00:13:07.760 the target variable, okay? So this process 00:13:07.760 --> 00:13:09.120 that I've just described is called 00:13:09.120 --> 00:13:11.600 training the machine learning model, and 00:13:11.600 --> 00:13:14.279 the model is fundamentally a 00:13:14.279 --> 00:13:16.399 mathematical function that best 00:13:16.399 --> 00:13:18.399 approximates the relationship between 00:13:18.399 --> 00:13:20.639 the independent variables and the 00:13:20.639 --> 00:13:22.639 dependent variable. All right, so that's 00:13:22.639 --> 00:13:24.480 quite a bit of a mouthful, so let's jump 00:13:24.480 --> 00:13:26.320 into a diagram that maybe illustrates 00:13:26.320 --> 00:13:27.880 this more clearly. So let's say you have 00:13:27.880 --> 00:13:30.000 a dataset here, an Excel spreadsheet, 00:13:30.000 --> 00:13:32.160 right? And this Excel spreadsheet has a 00:13:32.160 --> 00:13:34.040 bunch of columns here and a bunch of 00:13:34.040 --> 00:13:36.800 rows, okay? So these rows here represent 00:13:36.800 --> 00:13:39.000 observations, or these rows are what 00:13:39.000 --> 00:13:40.959 we call observations or samples or data 00:13:40.959 --> 00:13:43.120 points in our data set, okay? So let's 00:13:43.120 --> 00:13:46.880 assume this data set is gathered by a 00:13:46.880 --> 00:13:49.959 marketing manager at a mall, at a retail 00:13:49.959 --> 00:13:52.279 mall, all right? So they've got all this 00:13:52.279 --> 00:13:54.920 information about the customers who 00:13:54.920 --> 00:13:56.800 purchase products at this mall, all right? 00:13:56.800 --> 00:13:58.519 So some of the information they've 00:13:58.519 --> 00:14:00.000 gotten about the customers are their 00:14:00.000 --> 00:14:01.839 gender, their age, their income, and the 00:14:01.839 --> 00:14:03.600 number of children. So all this 00:14:03.600 --> 00:14:05.680 information about the customers, we call 00:14:05.680 --> 00:14:07.360 this the independent or the feature 00:14:07.360 --> 00:14:10.079 variables, all right? And based on all 00:14:10.079 --> 00:14:12.759 this information about the customer, we 00:14:12.759 --> 00:14:16.199 also managed to get some or we record 00:14:16.199 --> 00:14:17.600 the information about how much the 00:14:17.600 --> 00:14:20.480 customer spends, all right? So this 00:14:20.480 --> 00:14:22.079 information or these numbers here, we call 00:14:22.079 --> 00:14:23.839 this the target variable or the 00:14:23.839 --> 00:14:26.600 dependent variable, right? So on the 00:14:26.600 --> 00:14:29.519 single row, the data point, one single sample, one 00:14:29.519 --> 00:14:32.560 single data point, contains all the data 00:14:32.560 --> 00:14:35.040 for the feature variables and one single 00:14:35.040 --> 00:14:37.800 value for the label or the target 00:14:37.800 --> 00:14:41.199 variable, okay? And the primary purpose of 00:14:41.199 --> 00:14:43.240 the machine learning model is to create 00:14:43.240 --> 00:14:45.519 a mapping from all your feature 00:14:45.519 --> 00:14:48.160 variables to your target variable, so 00:14:48.160 --> 00:14:50.920 somehow there's going to be a function, 00:14:50.920 --> 00:14:52.160 okay, this will be a mathematical 00:14:52.160 --> 00:14:54.800 function that maps all the values of 00:14:54.800 --> 00:14:57.040 your feature variable to the value of 00:14:57.040 --> 00:14:59.639 your target variable. In other words, this 00:14:59.639 --> 00:15:01.279 function represents the relationship 00:15:01.279 --> 00:15:03.360 between your feature variables and your 00:15:03.360 --> 00:15:07.079 target variable, okay? So this whole thing, 00:15:07.079 --> 00:15:08.560 this training process, we call this the 00:15:08.560 --> 00:15:11.320 fitting the model. And the target 00:15:11.320 --> 00:15:13.240 variable or the label, this thing here, 00:15:13.240 --> 00:15:15.120 this column here, or the values here, 00:15:15.120 --> 00:15:17.399 these are critical for providing a 00:15:17.399 --> 00:15:19.000 context to do the fitting or the 00:15:19.000 --> 00:15:21.160 training of the model. And once you've 00:15:21.160 --> 00:15:23.360 got a trained and fitted model, you can 00:15:23.360 --> 00:15:25.959 then use the model to make an accurate 00:15:25.959 --> 00:15:28.319 prediction of target values 00:15:28.319 --> 00:15:30.240 corresponding to new feature values that 00:15:30.240 --> 00:15:32.519 the model has yet to encounter or yet to 00:15:32.519 --> 00:15:34.759 see, and this, as I've already said 00:15:34.759 --> 00:15:36.240 earlier, this is called predictive 00:15:36.240 --> 00:15:38.480 analytics, okay? So let's see what's 00:15:38.480 --> 00:15:40.120 actually happening here, you take your 00:15:40.120 --> 00:15:43.079 training data, all right, so this is this 00:15:43.079 --> 00:15:44.880 whole bunch of data, this data set here 00:15:44.880 --> 00:15:47.440 consisting of a thousand rows of 00:15:47.440 --> 00:15:49.920 data, 10,000 rows of data, you take this 00:15:49.920 --> 00:15:52.040 entire data set, all right, this entire 00:15:52.040 --> 00:15:54.000 data set, you jam it into your machine 00:15:54.000 --> 00:15:56.519 learning algorithm, and a couple of hours 00:15:56.519 --> 00:15:58.079 later your machine learning algorithm 00:15:58.079 --> 00:16:01.360 comes up with a model. And the model is 00:16:01.360 --> 00:16:04.199 essentially a function that maps all 00:16:04.199 --> 00:16:05.959 your feature variables which is these 00:16:05.959 --> 00:16:08.199 four columns here, to your target 00:16:08.199 --> 00:16:10.440 variable which is this one single column 00:16:10.440 --> 00:16:14.279 here, okay? So once you have the model, you 00:16:14.279 --> 00:16:17.040 can put in a new data point. So basically 00:16:17.040 --> 00:16:19.079 the new data point represents data about a 00:16:19.079 --> 00:16:20.959 new customer, a new customer that you 00:16:20.959 --> 00:16:23.120 have never seen before. So let's say 00:16:23.120 --> 00:16:25.079 you've already got information about 00:16:25.079 --> 00:16:27.560 10,000 customers that have visited this 00:16:27.560 --> 00:16:29.920 mall and how much each of these 10,000 00:16:29.920 --> 00:16:31.519 customers have spent when they are at this 00:16:31.519 --> 00:16:34.040 mall. So now you have a totally new 00:16:34.040 --> 00:16:35.800 customer that comes in the mall, this 00:16:35.800 --> 00:16:37.800 customer has never come into this mall 00:16:37.800 --> 00:16:39.839 before, and what we know about this 00:16:39.839 --> 00:16:42.680 customer is that he is a male, the age is 00:16:42.680 --> 00:16:45.199 50, the income is 18, and they have nine 00:16:45.199 --> 00:16:48.160 children. So now when you take this data 00:16:48.160 --> 00:16:50.519 and you pump that into your model, your 00:16:50.519 --> 00:16:52.920 model is going to make a prediction, it's 00:16:52.920 --> 00:16:55.720 going to say, hey, you know what? Based on 00:16:55.720 --> 00:16:57.279 everything that I have been trained before 00:16:57.279 --> 00:16:59.360 and based on the model I've developed, 00:16:59.360 --> 00:17:01.959 I am going to predict that a customer 00:17:01.959 --> 00:17:04.880 that is of a male gender, of the age 50 00:17:04.880 --> 00:17:08.280 with the income of 18, and nine children, 00:17:08.280 --> 00:17:12.400 that customer is going to spend 25 ringgit 00:17:12.400 --> 00:17:15.839 at the mall. And this is it, this is what 00:17:15.839 --> 00:17:18.599 you want. Right there, right here, 00:17:18.599 --> 00:17:21.319 can you see here? That is the final 00:17:21.319 --> 00:17:23.480 output of your machine learning model. 00:17:23.480 --> 00:17:27.359 It's going to make a prediction about 00:17:27.359 --> 00:17:29.760 something that it has not ever seen 00:17:29.760 --> 00:17:32.919 before, okay? That is the core, this is 00:17:32.919 --> 00:17:35.520 essentially the core of machine learning. 00:17:35.520 --> 00:17:38.640 Predictive analytics, making prediction 00:17:38.640 --> 00:17:40.120 about the future 00:17:41.170 --> 00:17:43.799 based on a historical data set. 00:17:44.379 --> 00:17:47.440 Okay, so there are two areas of 00:17:47.440 --> 00:17:49.480 supervised learning, regression and 00:17:49.480 --> 00:17:51.400 classification. So regression is used to 00:17:51.400 --> 00:17:53.440 predict a numerical target variable, such 00:17:53.440 --> 00:17:55.320 as the price of a house or the salary of 00:17:55.320 --> 00:17:57.799 an employee, whereas classification is 00:17:57.799 --> 00:17:59.919 used to predict a categorical target 00:17:59.919 --> 00:18:03.559 variable or class label, okay? So for 00:18:03.559 --> 00:18:05.799 classification you can have either 00:18:05.799 --> 00:18:08.679 binary or multiclass, so, for example, 00:18:08.679 --> 00:18:11.559 binary will be just true or false, zero 00:18:11.559 --> 00:18:14.840 or one. So whether your machine is going 00:18:14.840 --> 00:18:17.360 to fail or is it not going to fail, right? 00:18:17.360 --> 00:18:19.000 So just two classes, two possible, 00:18:19.000 --> 00:18:21.640 outcomes, or is the customer going to 00:18:21.640 --> 00:18:23.679 make a purchase or is the customer not 00:18:23.679 --> 00:18:26.159 going to make a purchase. We call this 00:18:26.159 --> 00:18:28.120 binary classification. And then for 00:18:28.120 --> 00:18:29.679 multiclass, when there are more than two 00:18:29.679 --> 00:18:32.559 classes or types of values. So, for 00:18:32.559 --> 00:18:34.039 example, here this would be a 00:18:34.039 --> 00:18:35.760 classification problem. So if you have a 00:18:35.760 --> 00:18:37.960 data set here, you've got information 00:18:37.960 --> 00:18:39.360 about your customers, you've got your 00:18:39.360 --> 00:18:41.159 gender of the customer, the age of the 00:18:41.159 --> 00:18:42.919 customer, the salary of the customer, and 00:18:42.919 --> 00:18:44.640 you also have record about whether the 00:18:44.640 --> 00:18:47.679 customer made a purchase or not, okay? So 00:18:47.679 --> 00:18:50.080 you can take this data set to train a 00:18:50.080 --> 00:18:52.440 classification model, and then the 00:18:52.440 --> 00:18:54.120 classification model can then make a 00:18:54.120 --> 00:18:56.320 prediction about a new customer, and 00:18:56.320 --> 00:18:58.799 they're going to predict zero which 00:18:58.799 --> 00:19:00.480 means the customer didn't make a 00:19:00.480 --> 00:19:03.159 purchase or one which means the customer 00:19:03.159 --> 00:19:06.320 make a purchase, right? And regression, 00:19:06.320 --> 00:19:08.600 this is regression, so let's say you want 00:19:08.600 --> 00:19:11.280 to predict the wind speed, and you've got 00:19:11.280 --> 00:19:13.799 historical data about all these four 00:19:13.799 --> 00:19:16.559 other independent variables or feature 00:19:16.559 --> 00:19:18.039 variables, so you have recorded 00:19:18.039 --> 00:19:19.640 temperature, the pressure, the relative 00:19:19.640 --> 00:19:21.799 humidity, and the wind direction for the 00:19:21.799 --> 00:19:24.799 past 10 days, 15 days, or whatever, okay? So 00:19:24.799 --> 00:19:26.760 now you are going to train your machine 00:19:26.760 --> 00:19:28.720 learning model using this data set, and 00:19:28.720 --> 00:19:31.679 the target variable column, okay, this 00:19:31.679 --> 00:19:33.760 column here, the label is basically a 00:19:33.760 --> 00:19:37.080 number, right? So now with this number, 00:19:37.080 --> 00:19:39.600 this is a regression model, and so now 00:19:39.600 --> 00:19:41.760 you can put in a new data point, so a new 00:19:41.760 --> 00:19:45.080 data point means a new set of values for 00:19:45.080 --> 00:19:46.960 temperature, pressure, relative humidity, 00:19:46.960 --> 00:19:48.600 and wind direction, and your machine 00:19:48.600 --> 00:19:50.679 learning model will then predict the 00:19:50.679 --> 00:19:53.640 wind speed for that new data point, okay? 00:19:53.640 --> 00:19:57.480 So that's a regression model. 00:19:59.159 --> 00:20:02.280 All right. So in this particular topic 00:20:02.280 --> 00:20:04.919 I'm going to talk about the workflow of 00:20:04.919 --> 00:20:07.960 that's involved in machine learning. So 00:20:07.960 --> 00:20:12.640 in the previous slides, I talked about 00:20:12.640 --> 00:20:14.600 developing the model, all right? But 00:20:14.600 --> 00:20:16.360 that's just one part of the entire 00:20:16.360 --> 00:20:19.080 workflow. So in real life when you use 00:20:19.080 --> 00:20:20.480 machine learning, there's an end-to-end 00:20:20.480 --> 00:20:22.480 workflow that's involved. So the first 00:20:22.480 --> 00:20:24.159 thing, of course, is you need to get your 00:20:24.159 --> 00:20:26.880 data, and then you need to clean your 00:20:26.880 --> 00:20:29.000 data, and then you need to explore your 00:20:29.000 --> 00:20:30.799 data. You need to see what's going on in 00:20:30.799 --> 00:20:33.280 your data set, right? And your data set, 00:20:33.280 --> 00:20:35.720 real life data sets are not trivial, they 00:20:35.720 --> 00:20:38.760 are hundreds of rows, thousands of rows, 00:20:38.760 --> 00:20:40.640 sometimes millions of rows, billions of 00:20:40.640 --> 00:20:43.080 rows, we're talking about billions or 00:20:43.080 --> 00:20:45.120 millions of data points especially if 00:20:45.120 --> 00:20:47.120 you're using an IoT sensor to get data 00:20:47.120 --> 00:20:49.000 in real time. So you've got all these 00:20:49.000 --> 00:20:51.320 super large data sets, you need to clean 00:20:51.320 --> 00:20:53.400 them, and explore them, and then you need 00:20:53.400 --> 00:20:56.360 to prepare them into a right format so 00:20:56.360 --> 00:20:59.600 that you can put them into the training 00:20:59.600 --> 00:21:01.520 process to create your machine learning 00:21:01.520 --> 00:21:04.799 model, and then subsequently you check 00:21:04.799 --> 00:21:07.559 how good is the model, right? How accurate 00:21:07.559 --> 00:21:10.080 is the model in terms of its ability to 00:21:10.080 --> 00:21:12.559 generate predictions for the 00:21:12.559 --> 00:21:14.960 future, right? How accurate are the 00:21:14.960 --> 00:21:16.679 predictions that are coming up from your 00:21:16.679 --> 00:21:18.400 machine learning model. So that's 00:21:18.400 --> 00:21:20.760 validating or evaluating your model, and 00:21:20.760 --> 00:21:22.559 then subsequently if you determine that 00:21:22.559 --> 00:21:25.400 your model is of adequate accuracy to 00:21:25.400 --> 00:21:27.240 meet whatever your domain use case 00:21:27.240 --> 00:21:29.400 requirements are, right? So let's say the 00:21:29.400 --> 00:21:31.440 accuracy that's required for your domain 00:21:31.440 --> 00:21:32.440 use case is 00:21:32.440 --> 00:21:35.320 85%, okay? If my machine learning model 00:21:35.320 --> 00:21:38.520 can give an 85% accuracy rate, I think 00:21:38.520 --> 00:21:40.159 it's good enough, then I'm going to 00:21:40.159 --> 00:21:42.880 deploy it into real world use case. So 00:21:42.880 --> 00:21:45.000 here the machine learning model gets 00:21:45.000 --> 00:21:48.440 deployed on the server, and then other, 00:21:48.440 --> 00:21:50.760 you know, other data sources are going to 00:21:50.760 --> 00:21:52.559 be captured from somewhere. That data is 00:21:52.559 --> 00:21:54.200 pump into the machine learning model. The 00:21:54.200 --> 00:21:55.440 machine learning model generates 00:21:55.440 --> 00:21:57.760 predictions, and those predictions are 00:21:57.760 --> 00:21:59.600 then used to make decisions on the 00:21:59.600 --> 00:22:02.000 factory floor in real time or in any 00:22:02.000 --> 00:22:04.559 other particular scenario. And then you 00:22:04.559 --> 00:22:06.840 constantly monitor and update the model, 00:22:06.840 --> 00:22:09.360 you get more new data, and then the 00:22:09.360 --> 00:22:11.960 entire cycle repeats itself. So that's 00:22:11.960 --> 00:22:14.480 your machine learning workflow, okay, in a 00:22:14.480 --> 00:22:16.919 nutshell. Here's another example of 00:22:16.919 --> 00:22:18.520 the same thing maybe in a slightly 00:22:18.520 --> 00:22:20.039 different format, so, again, you have your 00:22:20.039 --> 00:22:22.159 data collection and preparation. Here we 00:22:22.159 --> 00:22:24.360 talk more about the different kinds of 00:22:24.360 --> 00:22:26.520 algorithms that available to create a 00:22:26.520 --> 00:22:28.120 model, and I'll talk about this more in 00:22:28.120 --> 00:22:30.000 detail when we look at the real world 00:22:30.000 --> 00:22:32.320 example of a end-to-end machine learning 00:22:32.320 --> 00:22:34.559 workflow for the predictive maintenance 00:22:34.559 --> 00:22:36.880 use case. So once you have chosen the 00:22:36.880 --> 00:22:38.840 appropriate algorithm, you then have 00:22:38.840 --> 00:22:41.240 trained your model, you then have 00:22:41.240 --> 00:22:44.080 selected the appropriate train model 00:22:44.080 --> 00:22:46.440 among the multiple models. You are 00:22:46.440 --> 00:22:47.799 probably going to develop multiple 00:22:47.799 --> 00:22:49.559 models from multiple algorithms, you're 00:22:49.559 --> 00:22:51.679 going to evaluate them all, and then 00:22:51.679 --> 00:22:53.200 you're going to say, hey, you know what? 00:22:53.200 --> 00:22:55.279 After I've evaluated and tested that, 00:22:55.279 --> 00:22:57.480 I've chosen the best model, I'm going to 00:22:57.480 --> 00:22:59.640 deploy the model, all right, so this is 00:22:59.640 --> 00:23:02.640 for real life production use, okay? Real 00:23:02.640 --> 00:23:04.279 life sensor data is going to be pumped 00:23:04.279 --> 00:23:06.039 into my model, my model is going to 00:23:06.039 --> 00:23:08.039 generate predictions, the predicted data 00:23:08.039 --> 00:23:10.120 is going to used immediately in real 00:23:10.120 --> 00:23:12.840 time for real life decision making, and 00:23:12.840 --> 00:23:15.000 then I'm going to monitor, right, the 00:23:15.000 --> 00:23:17.440 results. So somebody's using the 00:23:17.440 --> 00:23:19.279 predictions from my model, if the 00:23:19.279 --> 00:23:21.880 predictions are lousy, that goes into the 00:23:21.880 --> 00:23:23.440 monitoring, the monitoring system 00:23:23.440 --> 00:23:25.279 captures that. If the predictions are 00:23:25.279 --> 00:23:27.720 fantastic, well that is also captured by the 00:23:27.720 --> 00:23:29.799 monitoring system, and that gets 00:23:29.799 --> 00:23:32.360 feedback again to the next cycle of my 00:23:32.360 --> 00:23:33.679 machine learning 00:23:33.679 --> 00:23:35.960 pipeline. Okay, so that's the kind of 00:23:35.960 --> 00:23:38.360 overall view, and here are the kind of 00:23:38.360 --> 00:23:41.559 key phases of your workflow. So one of 00:23:41.559 --> 00:23:43.960 the important phases is called EDA, 00:23:43.960 --> 00:23:47.520 exploratory data analysis and in this 00:23:47.520 --> 00:23:49.880 particular phase, you're going to 00:23:49.880 --> 00:23:53.120 do a lot of stuff, primarily just to 00:23:53.120 --> 00:23:54.880 understand your data set. So like I said, 00:23:54.880 --> 00:23:56.559 real life data sets, they tend to be very 00:23:56.559 --> 00:23:59.320 complex, and they tend to have various 00:23:59.320 --> 00:24:01.039 statistical properties, all right, 00:24:01.039 --> 00:24:02.679 statistics is a very important component 00:24:02.679 --> 00:24:05.600 of machine learning. So an EDA helps you 00:24:05.600 --> 00:24:07.480 to kind of get an overview of your data 00:24:07.480 --> 00:24:09.679 set, get an overview of any problems in 00:24:09.679 --> 00:24:11.520 your data set like any data that's 00:24:11.520 --> 00:24:13.440 missing, the statistical properties of your 00:24:13.440 --> 00:24:15.159 data set, the distribution of your data 00:24:15.159 --> 00:24:17.279 set, the statistical correlation of 00:24:17.279 --> 00:24:19.190 variables in your data set, etc, 00:24:19.190 --> 00:24:23.400 etc. Okay, then we have data cleaning or 00:24:23.400 --> 00:24:25.279 sometimes you call it data cleansing, and 00:24:25.279 --> 00:24:27.600 in this phase what you want to do is 00:24:27.600 --> 00:24:29.440 primarily, you want to kind of do things 00:24:29.440 --> 00:24:31.960 like remove duplicate records or rows in 00:24:31.960 --> 00:24:33.679 your table, you want to make sure that 00:24:33.679 --> 00:24:36.799 your data or your data 00:24:36.799 --> 00:24:39.399 points or your samples have appropriate IDs, 00:24:39.399 --> 00:24:41.080 and most importantly, you want to make 00:24:41.080 --> 00:24:43.039 sure there's not too many missing values 00:24:43.039 --> 00:24:44.880 in your data set. So what I mean by 00:24:44.880 --> 00:24:46.320 missing values are things like that, 00:24:46.320 --> 00:24:48.200 right? You have got a data set, and for 00:24:48.200 --> 00:24:51.640 some reason there are some cells or 00:24:51.640 --> 00:24:54.559 locations in your data set which are 00:24:54.559 --> 00:24:56.520 missing values, right? And if you have a 00:24:56.520 --> 00:24:58.679 lot of these missing values, then you've 00:24:58.679 --> 00:25:00.440 got a poor quality data set, and you're 00:25:00.440 --> 00:25:02.200 not going to be able to build a good 00:25:02.200 --> 00:25:04.159 model from this data set. You're not 00:25:04.159 --> 00:25:06.000 going to be able to train a good machine 00:25:06.000 --> 00:25:08.120 learning model from a data set with a 00:25:08.120 --> 00:25:10.200 lot of missing values like this. So you 00:25:10.200 --> 00:25:11.880 have to figure out whether there are a 00:25:11.880 --> 00:25:13.399 lot of missing values in your data set, 00:25:13.399 --> 00:25:15.399 how do you handle them. Another thing 00:25:15.399 --> 00:25:16.919 that's important in data cleansing is 00:25:16.919 --> 00:25:18.799 figuring out the outliers in your data 00:25:18.799 --> 00:25:21.919 set. So outliers are things like this, 00:25:21.919 --> 00:25:24.039 you know, data points that are very far from 00:25:24.039 --> 00:25:26.440 the general trend of data points in your 00:25:26.440 --> 00:25:29.559 data set, right? And so there are also 00:25:29.559 --> 00:25:31.919 several ways to detect outliers in your 00:25:31.919 --> 00:25:34.200 data set, and there are several ways to 00:25:34.200 --> 00:25:36.640 handle outliers in your data set. 00:25:36.640 --> 00:25:38.200 Similarly as well, there are several ways 00:25:38.200 --> 00:25:39.960 to handle missing values in your data 00:25:39.960 --> 00:25:42.880 set. So handling missing values, handling 00:25:42.880 --> 00:25:45.679 outliers, those are really two very key 00:25:45.679 --> 00:25:47.279 importance of data 00:25:47.279 --> 00:25:49.120 cleansing, and there are many, many 00:25:49.120 --> 00:25:50.760 techniques to handle this, so a data 00:25:50.760 --> 00:25:52.000 scientist needs to be acquainted with 00:25:52.000 --> 00:25:55.360 all of this. All right, why do I need to 00:25:55.360 --> 00:25:58.000 do data cleansing? Well, here is the key 00:25:58.000 --> 00:25:59.360 point. 00:25:59.360 --> 00:26:02.799 If you have a very poor quality data set, 00:26:02.799 --> 00:26:04.880 which means you've got a lot of outliers 00:26:04.880 --> 00:26:06.720 which are errors in your data set, or you 00:26:06.720 --> 00:26:08.159 got a lot of missing values in your data 00:26:08.159 --> 00:26:10.840 set, even though you've got a fantastic 00:26:10.840 --> 00:26:13.039 algorithm, you've got a fantastic model, 00:26:13.039 --> 00:26:15.720 the predictions that your model is going 00:26:15.720 --> 00:26:18.960 to give is absolutely rubbish. It's kind 00:26:18.960 --> 00:26:22.080 of like taking water and putting water 00:26:22.080 --> 00:26:26.000 into the tank of a Mercedes-Benz. So 00:26:26.000 --> 00:26:28.440 Mercedes-Benz is a great car, but if you 00:26:28.440 --> 00:26:30.080 take water and put it into your 00:26:30.080 --> 00:26:33.399 Mercedes-Benz, it will just die, right? Your 00:26:33.399 --> 00:26:36.520 car will just die, it can't run on water, 00:26:36.520 --> 00:26:38.279 right? On the other hand, if you have a 00:26:38.279 --> 00:26:41.559 Myvi, Myvi is just a lousy, shit car, but if 00:26:41.559 --> 00:26:44.840 you take a high octane, good petrol and 00:26:44.840 --> 00:26:47.240 you put into a Myvi, the Myvi will just go at, 00:26:47.240 --> 00:26:49.480 you know, 100 miles an hour. It would just 00:26:49.480 --> 00:26:51.159 completely destroy the Mercedes-Benz in 00:26:51.159 --> 00:26:53.360 terms of performance, so it 00:26:53.360 --> 00:26:54.799 doesn't really matter what model you're 00:26:54.799 --> 00:26:57.080 using here, right? So you can be using the most 00:26:57.080 --> 00:26:58.679 fantastic model like the 00:26:58.679 --> 00:27:01.200 Mercedes-Benz or machine learning, but if 00:27:01.200 --> 00:27:03.080 your data is lousy quality, your 00:27:03.080 --> 00:27:06.480 predictions is also going to be rubbish, 00:27:06.480 --> 00:27:10.000 okay? So cleansing data set is, in fact, 00:27:10.000 --> 00:27:11.880 probably the most important thing that 00:27:11.880 --> 00:27:13.640 data scientists need to do and that's 00:27:13.640 --> 00:27:15.520 what they spend most of the time doing, 00:27:15.520 --> 00:27:17.600 right, building the model, training the 00:27:17.600 --> 00:27:20.240 model, getting the right algorithms, and 00:27:20.240 --> 00:27:23.240 so on, that's really a small portion of 00:27:23.240 --> 00:27:25.200 the actual machine learning workflow, 00:27:25.200 --> 00:27:27.360 right? The actual machine learning 00:27:27.360 --> 00:27:29.679 workflow, the vast majority of time is on 00:27:29.679 --> 00:27:31.559 cleaning and organizing your 00:27:31.559 --> 00:27:33.360 data. Then you have something called 00:27:33.360 --> 00:27:35.080 feature engineering which is you 00:27:35.080 --> 00:27:37.000 preprocess the feature variables of 00:27:37.000 --> 00:27:38.919 your original data set prior to using 00:27:38.919 --> 00:27:40.600 them to train the model, and this is 00:27:40.600 --> 00:27:41.960 either through addition, deletion, 00:27:41.960 --> 00:27:43.600 combination, or transformation of these 00:27:43.600 --> 00:27:45.399 variables. And then the idea is you want 00:27:45.399 --> 00:27:47.000 to improve the predictive accuracy of 00:27:47.000 --> 00:27:49.320 the model, and also because some models 00:27:49.320 --> 00:27:51.080 can only work with numeric data, so you 00:27:51.080 --> 00:27:53.720 need to transform categorical data into 00:27:53.720 --> 00:27:57.039 numeric data. All right, so just now, in 00:27:57.039 --> 00:27:58.799 the earlier slides, I showed you that you 00:27:58.799 --> 00:28:00.760 take your original data set, you pump it 00:28:00.760 --> 00:28:03.200 into algorithm, and then a couple of hours 00:28:03.200 --> 00:28:05.200 later, you get a machine learning model, 00:28:05.200 --> 00:28:08.640 right? So you didn't do anything to your 00:28:08.640 --> 00:28:10.159 data set, to the feature variables in 00:28:10.159 --> 00:28:12.159 your data set before you pump it into a 00:28:12.159 --> 00:28:14.399 machine learning algorithm. So 00:28:14.399 --> 00:28:15.840 what I showed you earlier is you just 00:28:15.840 --> 00:28:18.919 take the data set exactly as it is and 00:28:18.919 --> 00:28:20.799 you just pump it into the algorithm, 00:28:20.799 --> 00:28:23.120 couple of hours later, you get a model, 00:28:23.120 --> 00:28:27.640 right? But that's not what generally 00:28:27.640 --> 00:28:29.600 happens in in real life. In real life, 00:28:29.600 --> 00:28:31.559 you're going to take all the original 00:28:31.559 --> 00:28:34.320 feature variables from your data set and 00:28:34.320 --> 00:28:36.720 you're going to transform them in some 00:28:36.720 --> 00:28:38.960 way. So you can see here these are the 00:28:38.960 --> 00:28:42.120 columns of data from my original data set, 00:28:42.120 --> 00:28:46.039 and before I actually put all these data 00:28:46.039 --> 00:28:48.240 points from my original data set into my 00:28:48.240 --> 00:28:50.720 algorithm to train and get my model, I 00:28:50.720 --> 00:28:54.960 will actually transform them, okay? So the 00:28:54.960 --> 00:28:57.600 transformation of these feature variable 00:28:57.600 --> 00:29:00.600 values, we call this feature engineering. 00:29:00.600 --> 00:29:02.440 And there are many, many techniques to do 00:29:02.440 --> 00:29:04.960 feature engineering, so one-hot encoding, 00:29:04.960 --> 00:29:08.279 scaling, log transformation, 00:29:08.279 --> 00:29:10.480 line:1 discretization, date extraction, boolean 00:29:10.480 --> 00:29:12.039 logic, etc, etc. 00:29:12.039 --> 00:29:14.880 Okay, then finally we do something 00:29:14.880 --> 00:29:16.799 called a train-test split, so where we 00:29:16.799 --> 00:29:19.440 take our original dataset, right? So this 00:29:19.440 --> 00:29:21.360 was the original dataset, and we break 00:29:21.360 --> 00:29:23.720 it into two parts, so one is called the 00:29:23.720 --> 00:29:25.760 training dataset and the other is 00:29:25.760 --> 00:29:28.120 called the test dataset. And the primary 00:29:28.120 --> 00:29:30.000 purpose for this is when we feed and 00:29:30.000 --> 00:29:31.399 train the machine learning model, we're 00:29:31.399 --> 00:29:32.640 going to use what is called the training 00:29:32.640 --> 00:29:35.559 dataset, and when we want to evaluate 00:29:35.559 --> 00:29:37.399 the accuracy of the model, right? So this 00:29:37.399 --> 00:29:40.960 is the key part of your machine learning 00:29:40.960 --> 00:29:43.640 life cycle because you are not only just 00:29:43.640 --> 00:29:45.440 going to have one possible models 00:29:45.440 --> 00:29:47.720 because there are a vast range of 00:29:47.720 --> 00:29:50.080 algorithms that you can use to create a 00:29:50.080 --> 00:29:53.000 model. So fundamentally you have a wide 00:29:53.000 --> 00:29:55.679 range of choices, right, like wide range 00:29:55.679 --> 00:29:57.640 of cars, right? You want to buy a car, you 00:29:57.640 --> 00:30:00.559 can buy a Myvi, you can buy a Perodua, 00:30:00.559 --> 00:30:02.640 you can buy a Honda you can buy a 00:30:02.640 --> 00:30:05.039 mercedesbenz you can buy a Audi you can 00:30:05.039 --> 00:30:07.760 buy a beamer many many different cars 00:30:07.760 --> 00:30:09.240 you that available for you if you want 00:30:09.240 --> 00:30:11.679 to buy a car right same thing with a 00:30:11.679 --> 00:30:14.360 machine learning model that are aast 00:30:14.360 --> 00:30:16.720 variety of algorithms that you can 00:30:16.720 --> 00:30:19.480 choose from in order to create a model 00:30:19.480 --> 00:30:21.519 and so once you create a model from a 00:30:21.519 --> 00:30:24.480 given algorithm you need to say hey how 00:30:24.480 --> 00:30:26.440 accurate is this model that have created 00:30:26.440 --> 00:30:28.640 from this algorithm and and different 00:30:28.640 --> 00:30:30.399 algorithms are going to create different 00:30:30.399 --> 00:30:33.720 models with different rates of accuracy 00:30:33.720 --> 00:30:35.679 and so the primary purpose of the test 00:30:35.679 --> 00:30:38.200 data set is to evaluate the ACC accuracy 00:30:38.200 --> 00:30:41.480 of the model to see hey is this model 00:30:41.480 --> 00:30:43.360 that I've created using this algorithm 00:30:43.360 --> 00:30:45.880 is it adequate for me to use in a real 00:30:45.880 --> 00:30:48.600 life production use case Okay so that's 00:30:48.600 --> 00:30:52.320 what it's all about okay so this is my 00:30:52.320 --> 00:30:54.279 original data set I break it into my 00:30:54.279 --> 00:30:56.559 feature data uh feature data set and 00:30:56.559 --> 00:30:58.519 also my target variable colum so my 00:30:58.519 --> 00:31:00.639 feature variable uh colums the target 00:31:00.639 --> 00:31:02.200 variable colums and then I further break 00:31:02.200 --> 00:31:04.240 it into a training data set and a test 00:31:04.240 --> 00:31:06.600 data set the training data set is to use 00:31:06.600 --> 00:31:08.320 the train to create the machine learning 00:31:08.320 --> 00:31:10.480 model and then once the machine learning 00:31:10.480 --> 00:31:12.200 model is created I then use the test 00:31:12.200 --> 00:31:15.080 data set to evaluate the accuracy of the 00:31:15.080 --> 00:31:16.279 machine learning 00:31:16.279 --> 00:31:21.000 model all right and then finally we can 00:31:21.000 --> 00:31:23.200 see what are the different parts or 00:31:23.200 --> 00:31:26.080 aspects that go into a successful model 00:31:26.080 --> 00:31:29.519 so Eda about 10% data cleansing about 00:31:29.519 --> 00:31:32.360 20% feature engineering about 00:31:32.360 --> 00:31:36.320 25% selecting a specific algorithm about 00:31:36.320 --> 00:31:39.120 10% and then training the model from 00:31:39.120 --> 00:31:41.639 that algorithm about 15% and then 00:31:41.639 --> 00:31:43.679 finally evaluating the model deciding 00:31:43.679 --> 00:31:45.960 which is the best model with the highest 00:31:45.960 --> 00:31:50.679 accuracy rate that's about 00:31:54.080 --> 00:31:56.919 20% all right so we have reached the 00:31:56.919 --> 00:31:58.880 most interesting part of this 00:31:58.880 --> 00:32:01.039 presentation which is the demonstration 00:32:01.039 --> 00:32:03.760 of an endtoend machine learning workflow 00:32:03.760 --> 00:32:06.080 on a real life data set that 00:32:06.080 --> 00:32:10.080 demonstrates the use case of predictive 00:32:10.080 --> 00:32:13.519 maintenance so the for the data set for 00:32:13.519 --> 00:32:16.240 this particular use case I've used a 00:32:16.240 --> 00:32:19.200 data set from kegle so for those of you 00:32:19.200 --> 00:32:21.399 are not aware of this kegle is the 00:32:21.399 --> 00:32:24.880 world's largest open-source Community 00:32:24.880 --> 00:32:28.080 for data science and Ai and they have a 00:32:28.080 --> 00:32:31.159 large collection of data sets from all 00:32:31.159 --> 00:32:34.440 various uh areas of industry and human 00:32:34.440 --> 00:32:37.039 endeavor and they also have a large 00:32:37.039 --> 00:32:38.840 collection of models that have been 00:32:38.840 --> 00:32:42.880 developed using these data sets so here 00:32:42.880 --> 00:32:47.039 we have a data set for the particular 00:32:47.039 --> 00:32:50.519 use case predictive maintenance okay so 00:32:50.519 --> 00:32:52.919 this is some information about the data 00:32:52.919 --> 00:32:56.440 set uh so in case um you do not know how 00:32:56.440 --> 00:32:59.200 to get to there this is the URL to click 00:32:59.200 --> 00:33:02.240 on okay to get to that data set so once 00:33:02.240 --> 00:33:05.120 you at the data set here you can or the 00:33:05.120 --> 00:33:07.399 page for about this data set you can see 00:33:07.399 --> 00:33:09.960 all the information about this data set 00:33:09.960 --> 00:33:13.039 and you can download the data set in a 00:33:13.039 --> 00:33:14.159 CSV 00:33:14.159 --> 00:33:16.360 format okay so let's take a look at the 00:33:16.360 --> 00:33:19.559 data set so this data set has a total of 00:33:19.559 --> 00:33:23.440 10,000 samples okay and these are the 00:33:23.440 --> 00:33:26.279 feature variables the type the product 00:33:26.279 --> 00:33:28.440 ID the add temperature process 00:33:28.440 --> 00:33:31.000 temperature rotational speed talk tool 00:33:31.000 --> 00:33:34.799 Weare and this is the target variable 00:33:34.799 --> 00:33:36.720 all right so the target variable is what 00:33:36.720 --> 00:33:38.159 we are interested in what we are 00:33:38.159 --> 00:33:40.960 interested in using to train the machine 00:33:40.960 --> 00:33:42.600 learning model and also what we 00:33:42.600 --> 00:33:45.279 interested to predict okay so these are 00:33:45.279 --> 00:33:47.960 the feature variables they describe or 00:33:47.960 --> 00:33:49.960 they provide information about this 00:33:49.960 --> 00:33:52.880 particular machine on the production 00:33:52.880 --> 00:33:55.080 line on the assembly line so you might 00:33:55.080 --> 00:33:56.799 know the product ID the type the air 00:33:56.799 --> 00:33:58.120 temperature process temperature 00:33:58.120 --> 00:34:00.480 rotational speed talk to where right so 00:34:00.480 --> 00:34:03.159 let's say you've got a iot sensor system 00:34:03.159 --> 00:34:06.120 that's basically capturing all this data 00:34:06.120 --> 00:34:08.359 about a product or a machine on your 00:34:08.359 --> 00:34:10.679 production or assembly line okay and 00:34:10.679 --> 00:34:13.918 you've also captured information about 00:34:13.918 --> 00:34:17.199 whether is for a specific uh sample 00:34:17.199 --> 00:34:19.839 whether that sample uh experien a 00:34:19.839 --> 00:34:23.040 failure or not okay so the target value 00:34:23.040 --> 00:34:25.520 of zero okay indicates that there's no 00:34:25.520 --> 00:34:28.000 failure so zero means no failure and we 00:34:28.000 --> 00:34:30.199 can see that the vast majority of data 00:34:30.199 --> 00:34:32.520 points in this data set are no failure 00:34:32.520 --> 00:34:34.000 and here we can see an example here 00:34:34.000 --> 00:34:36.719 where you have a case of a failure so a 00:34:36.719 --> 00:34:40.159 failure is marked as a one positive and 00:34:40.159 --> 00:34:42.639 no failure is marked as zero negative 00:34:42.639 --> 00:34:44.879 all right so here we have one type of a 00:34:44.879 --> 00:34:47.040 failure it's called a power failure and 00:34:47.040 --> 00:34:49.000 if you scroll down the data set you see 00:34:49.000 --> 00:34:50.399 there are also other kinds of failures 00:34:50.399 --> 00:34:52.839 like a towar 00:34:52.839 --> 00:34:56.960 failure uh we have a over strain failure 00:34:56.960 --> 00:34:58.680 here for example 00:34:58.680 --> 00:35:00.760 uh we also have a power failure again 00:35:00.760 --> 00:35:02.200 and so on so if you scroll down through 00:35:02.200 --> 00:35:04.160 these 10,000 data points and or if 00:35:04.160 --> 00:35:06.040 you're familiar with using Excel to 00:35:06.040 --> 00:35:08.839 filter out values in a colume you can 00:35:08.839 --> 00:35:12.280 see that in this particular colume here 00:35:12.280 --> 00:35:14.480 which is the so-called Target variable 00:35:14.480 --> 00:35:16.960 colume you are going to have the vast 00:35:16.960 --> 00:35:18.920 majority of values as zero which means 00:35:18.920 --> 00:35:22.760 no failure and some of the rows or the 00:35:22.760 --> 00:35:24.040 data points you are going to have a 00:35:24.040 --> 00:35:26.359 value of one and for those rows that you 00:35:26.359 --> 00:35:28.119 have a value of one for example example 00:35:28.119 --> 00:35:31.280 here you are sorry for example here you 00:35:31.280 --> 00:35:32.839 are going to have different types of 00:35:32.839 --> 00:35:34.640 failure so like I said just now power 00:35:34.640 --> 00:35:38.960 failure tool set filia etc etc so we are 00:35:38.960 --> 00:35:40.640 going to go through the entire machine 00:35:40.640 --> 00:35:43.599 learning workflow process with this data 00:35:43.599 --> 00:35:46.640 set so to see an example of that we are 00:35:46.640 --> 00:35:50.400 going to use a we're going to go to the 00:35:50.400 --> 00:35:52.280 code section here all right so if I 00:35:52.280 --> 00:35:54.280 click on the code section here and right 00:35:54.280 --> 00:35:56.400 down here we have see what is called a 00:35:56.400 --> 00:35:59.359 data set notebook so this is basically a 00:35:59.359 --> 00:36:02.319 Jupiter notebook Jupiter is basically an 00:36:02.319 --> 00:36:05.280 python application which allows you to 00:36:05.280 --> 00:36:09.240 create a python machine learning 00:36:09.240 --> 00:36:11.680 program that basically builds your 00:36:11.680 --> 00:36:14.520 machine learning model assesses or 00:36:14.520 --> 00:36:16.480 evaluates his accuracy and generates 00:36:16.480 --> 00:36:19.040 predictions from it okay so here we have 00:36:19.040 --> 00:36:21.680 a whole bunch of Jupiter notebooks that 00:36:21.680 --> 00:36:24.560 are available and you can select any one 00:36:24.560 --> 00:36:26.000 of them all these notebooks are 00:36:26.000 --> 00:36:28.720 essentially going to process the data 00:36:28.720 --> 00:36:31.720 from this particular data set so if I go 00:36:31.720 --> 00:36:34.720 to this code page here I've actually 00:36:34.720 --> 00:36:37.319 selected a specific notebook that I'm 00:36:37.319 --> 00:36:39.960 going to run through to demonstrate an 00:36:39.960 --> 00:36:42.839 endtoend machine learning workflow using 00:36:42.839 --> 00:36:45.560 various machine learning libraries from 00:36:45.560 --> 00:36:49.800 the Python programming language okay so 00:36:49.800 --> 00:36:52.440 the uh particular notebook I'm going to 00:36:52.440 --> 00:36:55.160 use is this particular notebook here and 00:36:55.160 --> 00:36:57.160 you can also get the URL for that 00:36:57.160 --> 00:37:00.440 particular The Notebook from 00:37:00.440 --> 00:37:03.760 here okay so let's quickly do a quick 00:37:03.760 --> 00:37:06.000 revision again what are we trying to do 00:37:06.000 --> 00:37:08.000 here we're trying to build a machine 00:37:08.000 --> 00:37:11.359 learning classification model right so 00:37:11.359 --> 00:37:12.960 we said there are two primary areas of 00:37:12.960 --> 00:37:14.560 supervised learning one is regression 00:37:14.560 --> 00:37:16.200 which is used to predict a numerical 00:37:16.200 --> 00:37:18.640 Target variable and the second kind of 00:37:18.640 --> 00:37:21.359 supervised learning is classification 00:37:21.359 --> 00:37:23.079 which is what we're doing here we're 00:37:23.079 --> 00:37:25.839 trying to predict a categorical Target 00:37:25.839 --> 00:37:29.680 variable okay so in this particular 00:37:29.680 --> 00:37:32.119 example we actually have two kinds of 00:37:32.119 --> 00:37:34.480 ways we can classify either a binary 00:37:34.480 --> 00:37:37.560 classification or a multiclass 00:37:37.560 --> 00:37:39.520 classification so for binary 00:37:39.520 --> 00:37:41.440 classification we are only going to 00:37:41.440 --> 00:37:43.400 classify the product or machine as 00:37:43.400 --> 00:37:47.160 either it failed or it did not fail okay 00:37:47.160 --> 00:37:48.880 so if we go back to the data set that I 00:37:48.880 --> 00:37:50.839 showed you just now if you look at this 00:37:50.839 --> 00:37:52.680 target variable colume there are only 00:37:52.680 --> 00:37:54.520 two possible values here they either 00:37:54.520 --> 00:37:58.280 zero or one zero means there's no fi 00:37:58.280 --> 00:38:01.240 one means that's a failure okay so this 00:38:01.240 --> 00:38:03.440 is an example of a binary classification 00:38:03.440 --> 00:38:07.240 only two possible outcomes zero or one 00:38:07.240 --> 00:38:10.119 didn't fail or fail all right two 00:38:10.119 --> 00:38:13.079 possible outcomes and then we can also 00:38:13.079 --> 00:38:15.480 for the same data set we can extend it 00:38:15.480 --> 00:38:18.079 and make it a multiclass classification 00:38:18.079 --> 00:38:20.880 problem all right so if we kind of want 00:38:20.880 --> 00:38:23.720 to drill down further we can say that 00:38:23.720 --> 00:38:26.800 not only is there a failure we can 00:38:26.800 --> 00:38:29.200 actually say that are different types of 00:38:29.200 --> 00:38:32.440 failures okay so we have one category of 00:38:32.440 --> 00:38:35.599 class that is basically no failure okay 00:38:35.599 --> 00:38:37.400 then we have a category for the 00:38:37.400 --> 00:38:40.400 different types of failures right so you 00:38:40.400 --> 00:38:43.920 can have a power failure you could have 00:38:43.920 --> 00:38:46.400 a tool Weare 00:38:46.400 --> 00:38:48.920 failure uh you could have let's go down 00:38:48.920 --> 00:38:50.880 here you could have a over strain 00:38:50.880 --> 00:38:53.760 failure and etc etc so you can have 00:38:53.760 --> 00:38:57.160 multiple classes of failure in addition 00:38:57.160 --> 00:39:00.520 to the general overall or the majority 00:39:00.520 --> 00:39:04.319 class of no failure and that would be a 00:39:04.319 --> 00:39:06.680 multiclass classification problem so 00:39:06.680 --> 00:39:08.400 with this data set we are going to see 00:39:08.400 --> 00:39:11.040 how to make it a binary classification 00:39:11.040 --> 00:39:12.800 problem and also a multiclass 00:39:12.800 --> 00:39:15.079 classification problem okay so let's 00:39:15.079 --> 00:39:16.880 look at the workflow so let's say we've 00:39:16.880 --> 00:39:18.880 already got the data so right now we do 00:39:18.880 --> 00:39:20.839 have the data set this is the data set 00:39:20.839 --> 00:39:22.720 that we have so let's assume we've 00:39:22.720 --> 00:39:24.560 somehow managed to get this data set 00:39:24.560 --> 00:39:26.880 from some iot sensors that are 00:39:26.880 --> 00:39:29.119 monitoring realtime data in our 00:39:29.119 --> 00:39:31.079 production environment on the assembly 00:39:31.079 --> 00:39:32.800 line on the production line we've got 00:39:32.800 --> 00:39:34.680 sensors reading data that gives us all 00:39:34.680 --> 00:39:37.960 these data that we have in this CSV file 00:39:37.960 --> 00:39:40.079 Okay so we've already got the data we've 00:39:40.079 --> 00:39:41.599 retrieved the data now we're going to go 00:39:41.599 --> 00:39:45.000 on to the cleaning and exploration part 00:39:45.000 --> 00:39:47.520 of your machine learning life cycle all 00:39:47.520 --> 00:39:49.800 right so let's look at the data cleaning 00:39:49.800 --> 00:39:51.400 part so the data cleaning part we 00:39:51.400 --> 00:39:53.720 interested in uh checking for missing 00:39:53.720 --> 00:39:56.200 values and maybe removing the rows you 00:39:56.200 --> 00:39:58.079 missing values okay 00:39:58.079 --> 00:39:59.760 uh so the kind of things we can sorry 00:39:59.760 --> 00:40:01.000 the kind of things we can do in missing 00:40:01.000 --> 00:40:02.880 values we can remove the row missing 00:40:02.880 --> 00:40:05.839 values we can put in some new values uh 00:40:05.839 --> 00:40:08.000 some replacement values which could be a 00:40:08.000 --> 00:40:09.880 average of all the values in that that 00:40:09.880 --> 00:40:12.880 particular colume etc etc we also try to 00:40:12.880 --> 00:40:15.480 identify outliers in our data set and 00:40:15.480 --> 00:40:17.480 also there are a variety of ways to deal 00:40:17.480 --> 00:40:19.480 with that so this is called Data 00:40:19.480 --> 00:40:21.359 cleansing which is a really important 00:40:21.359 --> 00:40:23.319 part of your machine learning workflow 00:40:23.319 --> 00:40:25.520 right so that's where we are now at 00:40:25.520 --> 00:40:26.839 we're doing cleansing and then we're 00:40:26.839 --> 00:40:28.839 going to follow up with 00:40:28.839 --> 00:40:31.160 exploration so let's look at the actual 00:40:31.160 --> 00:40:33.160 code that does the cleansing here so 00:40:33.160 --> 00:40:35.800 here we are right at the start of the uh 00:40:35.800 --> 00:40:38.400 machine learning uh life cycle here so 00:40:38.400 --> 00:40:40.839 this is a Jupiter notebook so here we 00:40:40.839 --> 00:40:43.359 have a brief description of the problem 00:40:43.359 --> 00:40:45.920 statement all right so this data set 00:40:45.920 --> 00:40:47.640 reflects real life predictive 00:40:47.640 --> 00:40:49.240 maintenance enounter industry with 00:40:49.240 --> 00:40:50.480 measurements from real equipment the 00:40:50.480 --> 00:40:52.400 features description is taken directly 00:40:52.400 --> 00:40:54.520 from the data source set so here we have 00:40:54.520 --> 00:40:57.400 a description of the six key features in 00:40:57.400 --> 00:40:59.599 our data set type which is the quality 00:40:59.599 --> 00:41:02.520 of the product the air temperature the 00:41:02.520 --> 00:41:04.680 process temperature the rotational speed 00:41:04.680 --> 00:41:06.599 the talk and the towar all right so 00:41:06.599 --> 00:41:08.880 these are the six feature variables and 00:41:08.880 --> 00:41:11.319 there are the two target variables so 00:41:11.319 --> 00:41:13.119 just now I showed you just now there's 00:41:13.119 --> 00:41:15.119 one target variable which only has two 00:41:15.119 --> 00:41:17.440 possible values either zero or one okay 00:41:17.440 --> 00:41:20.079 zero or one means failure or no failure 00:41:20.079 --> 00:41:23.079 so that will be this colume here right 00:41:23.079 --> 00:41:24.880 so let me go all the way back up to here 00:41:24.880 --> 00:41:26.640 so this colume here we already saw it 00:41:26.640 --> 00:41:29.440 only has two I values is either zero or 00:41:29.440 --> 00:41:32.680 one and then we also have this column 00:41:32.680 --> 00:41:35.040 here and this column here is basically 00:41:35.040 --> 00:41:38.079 the failure type and so the we have as I 00:41:38.079 --> 00:41:40.800 already demonstrated just now we do have 00:41:40.800 --> 00:41:43.440 uh several categories of or types of 00:41:43.440 --> 00:41:45.560 failure and so here we call this 00:41:45.560 --> 00:41:47.079 multiclass 00:41:47.079 --> 00:41:50.000 classification so we can either build a 00:41:50.000 --> 00:41:51.839 binary classification model for this 00:41:51.839 --> 00:41:53.520 problem domain or we can build a 00:41:53.520 --> 00:41:55.079 multiclass 00:41:55.079 --> 00:41:58.119 classification problem all right so this 00:41:58.119 --> 00:41:59.839 jupyter notebook is going to demonstrate 00:41:59.839 --> 00:42:02.319 both approaches to us so first step we 00:42:02.319 --> 00:42:04.800 are going to write all this python code 00:42:04.800 --> 00:42:06.880 that's going to import all the libraries 00:42:06.880 --> 00:42:09.079 that we need to use okay so this is 00:42:09.079 --> 00:42:12.319 basically python code okay and it's 00:42:12.319 --> 00:42:15.119 importing the relevant machine learn 00:42:15.119 --> 00:42:17.960 oops we are importing the relevant 00:42:17.960 --> 00:42:20.599 machine learning libraries related to 00:42:20.599 --> 00:42:23.520 our domain use case okay then we load in 00:42:23.520 --> 00:42:26.440 our data set okay so this our data set 00:42:26.440 --> 00:42:28.319 we describe it we have some quick 00:42:28.319 --> 00:42:30.920 insights into the data set um and then 00:42:30.920 --> 00:42:32.839 we just take a look at all the variables 00:42:32.839 --> 00:42:36.000 of the feature variables Etc and so on 00:42:36.000 --> 00:42:38.000 we just what we're doing now is just 00:42:38.000 --> 00:42:39.800 doing a quick overview of the data set 00:42:39.800 --> 00:42:41.559 so this all this python code here they 00:42:41.559 --> 00:42:43.760 were writing is allowing us the data 00:42:43.760 --> 00:42:45.359 scientist to get a quick overview of our 00:42:45.359 --> 00:42:48.359 data set right okay like how many um V 00:42:48.359 --> 00:42:50.240 how many rows are there how many columns 00:42:50.240 --> 00:42:51.760 are there what are the data types of the 00:42:51.760 --> 00:42:53.440 colums what are the name of the columns 00:42:53.440 --> 00:42:57.359 etc etc okay then we zoom in on to the 00:42:57.359 --> 00:42:58.839 Target variables so we look at the 00:42:58.839 --> 00:43:02.000 Target variables how many uh counts 00:43:02.000 --> 00:43:04.520 there are of this target variable uh and 00:43:04.520 --> 00:43:06.440 so on how many different types of 00:43:06.440 --> 00:43:08.240 failures there are then you want to 00:43:08.240 --> 00:43:09.000 check whether there are any 00:43:09.000 --> 00:43:10.760 inconsistencies between the Target and 00:43:10.760 --> 00:43:13.559 the failure type Etc okay so when you do 00:43:13.559 --> 00:43:15.119 all this checking you're going to 00:43:15.119 --> 00:43:16.960 discover there are some discrepancies in 00:43:16.960 --> 00:43:20.280 your data set so using a specific python 00:43:20.280 --> 00:43:21.839 code to do checking you're going to say 00:43:21.839 --> 00:43:23.480 hey you know what there's some errors 00:43:23.480 --> 00:43:25.000 here right there are nine values that 00:43:25.000 --> 00:43:26.599 classify as failure and Target variable 00:43:26.599 --> 00:43:28.200 but as no no failure in the failure type 00:43:28.200 --> 00:43:29.720 variable so that means there's a 00:43:29.720 --> 00:43:33.200 discrepancy in your data point right so 00:43:33.200 --> 00:43:34.760 which are so these are all the ones that 00:43:34.760 --> 00:43:36.359 are discrepancies because the target 00:43:36.359 --> 00:43:39.000 variable says one and we already know 00:43:39.000 --> 00:43:41.240 that Target variable one is supposed to 00:43:41.240 --> 00:43:43.240 mean that it's a failure right target 00:43:43.240 --> 00:43:44.880 varable one is supposed to mean that is 00:43:44.880 --> 00:43:47.119 a failure so we are kind of expecting to 00:43:47.119 --> 00:43:49.680 see the failure classification but some 00:43:49.680 --> 00:43:51.400 rows actually say there's no failure 00:43:51.400 --> 00:43:53.800 although the target type is one but here 00:43:53.800 --> 00:43:55.920 is a classic example of an error that 00:43:55.920 --> 00:43:58.640 can very well Ur in a data set so now 00:43:58.640 --> 00:44:00.559 the question is what do you do with 00:44:00.559 --> 00:44:04.720 these errors in your data set right so 00:44:04.720 --> 00:44:06.240 here the data scientist says I think it 00:44:06.240 --> 00:44:07.520 would make sense to remove those 00:44:07.520 --> 00:44:09.920 instances and so they write some code 00:44:09.920 --> 00:44:12.680 then to remove those instances or those 00:44:12.680 --> 00:44:14.920 uh rows or data points from the overall 00:44:14.920 --> 00:44:17.280 data set and same thing we can again 00:44:17.280 --> 00:44:19.240 check for other ISU so we find there's 00:44:19.240 --> 00:44:21.160 another ISU here with our data set which 00:44:21.160 --> 00:44:24.079 is another warning so again we can 00:44:24.079 --> 00:44:26.240 possibly remove them so you're going to 00:44:26.240 --> 00:44:31.280 remove 20 7 instances or rows from your 00:44:31.280 --> 00:44:34.440 overall data set so your data set has a 00:44:34.440 --> 00:44:37.079 10,000 uh rows or data points you're 00:44:37.079 --> 00:44:40.160 removing 27 which is only 0.27 of the 00:44:40.160 --> 00:44:42.240 entire data set and these were the 00:44:42.240 --> 00:44:45.720 reasons why you remove them okay so if 00:44:45.720 --> 00:44:48.160 you're just removing to uh 0.27% of the 00:44:48.160 --> 00:44:50.800 anti data set no big deal right still 00:44:50.800 --> 00:44:53.079 okay but you needed to remove them 00:44:53.079 --> 00:44:55.720 because these errors right this 00:44:55.720 --> 00:44:58.040 27 um 00:44:58.040 --> 00:45:00.559 errors okay data points with errors in 00:45:00.559 --> 00:45:02.960 your data set could really affect the 00:45:02.960 --> 00:45:05.000 training of your machine learning model 00:45:05.000 --> 00:45:08.640 so we need to do your data cleansing 00:45:08.640 --> 00:45:11.720 right so we are actually cleansing now 00:45:11.720 --> 00:45:15.200 uh uh some kind of data that is 00:45:15.200 --> 00:45:17.520 incorrect or erroneous in your original 00:45:17.520 --> 00:45:21.440 data set okay so then we go on to the 00:45:21.440 --> 00:45:23.839 next part which is called Eda right so 00:45:23.839 --> 00:45:28.880 Eda is where we kind of explore our data 00:45:28.880 --> 00:45:31.720 and we want to kind of get a visual 00:45:31.720 --> 00:45:34.240 overview of our data as a whole and also 00:45:34.240 --> 00:45:35.880 take a look at the statistical 00:45:35.880 --> 00:45:38.160 properties of data the statistical 00:45:38.160 --> 00:45:40.480 distribution of the data in all the 00:45:40.480 --> 00:45:43.079 various colums the correlation between 00:45:43.079 --> 00:45:44.640 the variables between the feature 00:45:44.640 --> 00:45:46.680 variables different columns and also the 00:45:46.680 --> 00:45:48.599 feature variable and the target variable 00:45:48.599 --> 00:45:52.040 so all of this is called Eda and Eda in 00:45:52.040 --> 00:45:54.079 a machine learning workflow is typically 00:45:54.079 --> 00:45:57.160 done through visualization 00:45:57.160 --> 00:45:58.839 all right so let's go back here and take 00:45:58.839 --> 00:46:00.599 a look right so for example here we are 00:46:00.599 --> 00:46:03.400 looking at correlation so we plot the 00:46:03.400 --> 00:46:05.680 values of all the various feature 00:46:05.680 --> 00:46:07.599 variables against each other and look 00:46:07.599 --> 00:46:10.800 for potential correlations and patterns 00:46:10.800 --> 00:46:13.359 and so on and all the different shapes 00:46:13.359 --> 00:46:17.280 that you see here in this pair plot okay 00:46:17.280 --> 00:46:18.400 uh will have different meaning 00:46:18.400 --> 00:46:20.000 statistical meaning and so the data 00:46:20.000 --> 00:46:21.800 scientist has to kind of visually 00:46:21.800 --> 00:46:23.760 inspect this P plot makes some 00:46:23.760 --> 00:46:25.559 interpretations of these different 00:46:25.559 --> 00:46:27.680 patterns that he sees here all right so 00:46:27.680 --> 00:46:30.480 these are some of the insights that that 00:46:30.480 --> 00:46:32.839 can be deduced from looking at these 00:46:32.839 --> 00:46:34.319 pattern so for example the Tor and 00:46:34.319 --> 00:46:36.280 rotational speed are highly correlated 00:46:36.280 --> 00:46:38.040 the process temperature and a 00:46:38.040 --> 00:46:39.920 temperature so highly correlated that 00:46:39.920 --> 00:46:41.559 failures occur for extreme values of 00:46:41.559 --> 00:46:44.520 some features etc etc then you can plot 00:46:44.520 --> 00:46:45.960 certain kinds of charts this called a 00:46:45.960 --> 00:46:48.480 violing chart to again get new insights 00:46:48.480 --> 00:46:49.839 for example regarding the talk and 00:46:49.839 --> 00:46:51.480 rotational speed it can see again that 00:46:51.480 --> 00:46:53.119 most failures are triggered for much 00:46:53.119 --> 00:46:55.119 lower or much higher values than the 00:46:55.119 --> 00:46:57.400 mean when they're not failing so all 00:46:57.400 --> 00:47:00.720 these visualizations they are there and 00:47:00.720 --> 00:47:02.480 a trained data scientist can look at 00:47:02.480 --> 00:47:05.079 them inspect them and make some kind of 00:47:05.079 --> 00:47:08.400 insightful deductions from them okay 00:47:08.400 --> 00:47:11.079 percentage of failure right uh the 00:47:11.079 --> 00:47:13.640 correlation heat map okay between all 00:47:13.640 --> 00:47:15.559 these different feature variables and 00:47:15.559 --> 00:47:16.920 also the target 00:47:16.920 --> 00:47:19.599 variable okay uh the product types 00:47:19.599 --> 00:47:21.079 percentage of product types percentage 00:47:21.079 --> 00:47:23.160 of failure with respect to the product 00:47:23.160 --> 00:47:25.720 type so we can also kind of visualize 00:47:25.720 --> 00:47:27.800 that as well so certain products have a 00:47:27.800 --> 00:47:29.839 higher ratio of faure compared to other 00:47:29.839 --> 00:47:33.240 product types Etc or for example uh M 00:47:33.240 --> 00:47:35.800 tends to feel more than H products etc 00:47:35.800 --> 00:47:38.880 etc so we can create a vast variety of 00:47:38.880 --> 00:47:41.319 visualizations in the Eda stage so you 00:47:41.319 --> 00:47:43.960 can see here and again the idea of this 00:47:43.960 --> 00:47:46.359 visualization is just to give us some 00:47:46.359 --> 00:47:49.680 insight some preliminary insight into 00:47:49.680 --> 00:47:52.520 our data set that helps us to model it 00:47:52.520 --> 00:47:54.119 more correctly so some more insights 00:47:54.119 --> 00:47:56.200 that we get into our data set from all 00:47:56.200 --> 00:47:57.599 this visualization 00:47:57.599 --> 00:47:59.559 then we can plot the distribution so we 00:47:59.559 --> 00:48:00.720 can see whether it's a normal 00:48:00.720 --> 00:48:03.079 distribution or some other kind of 00:48:03.079 --> 00:48:05.640 distribution uh we can have a box plot 00:48:05.640 --> 00:48:07.760 to see whether there are any outliers in 00:48:07.760 --> 00:48:10.400 your data set and so on right so we can 00:48:10.400 --> 00:48:11.640 see from the box plots we can see 00:48:11.640 --> 00:48:14.599 rotational speed and have outliers so we 00:48:14.599 --> 00:48:16.880 already saw outliers are basically a 00:48:16.880 --> 00:48:18.800 problem that you may need to kind of 00:48:18.800 --> 00:48:22.520 tackle right so outliers are an isue uh 00:48:22.520 --> 00:48:24.800 it's a it's a part of data cleansing and 00:48:24.800 --> 00:48:26.960 so you may need to tackle this so we may 00:48:26.960 --> 00:48:28.880 have to check okay well where are the 00:48:28.880 --> 00:48:31.319 potential outliers so we can analyze 00:48:31.319 --> 00:48:35.319 them from the box blot okay um but then 00:48:35.319 --> 00:48:37.079 we can say well they are outliers but 00:48:37.079 --> 00:48:38.800 maybe they're not really horrible 00:48:38.800 --> 00:48:40.760 outliers so we can tolerate them or 00:48:40.760 --> 00:48:42.880 maybe we want to remove them so we can 00:48:42.880 --> 00:48:44.920 see what the mean and maximum values for 00:48:44.920 --> 00:48:46.720 all these with respect to product type 00:48:46.720 --> 00:48:49.680 how many of them are above or highly 00:48:49.680 --> 00:48:51.440 correlated with the product type in 00:48:51.440 --> 00:48:54.240 terms of the maximum and minimum okay 00:48:54.240 --> 00:48:56.960 and then so on so the Insight is well we 00:48:56.960 --> 00:48:59.599 got 4.8% of the instances are outliers 00:48:59.599 --> 00:49:02.559 so maybe 4.87% is not really that much 00:49:02.559 --> 00:49:04.920 the outliers are not horrible so we just 00:49:04.920 --> 00:49:06.960 leave them in the data set now for a 00:49:06.960 --> 00:49:08.520 different data set the data scientist 00:49:08.520 --> 00:49:10.280 could come to different conclusion so 00:49:10.280 --> 00:49:12.280 then they would do whatever they've 00:49:12.280 --> 00:49:15.400 deemed is appropriate to kind of cleanse 00:49:15.400 --> 00:49:18.079 the data set okay so now that we have 00:49:18.079 --> 00:49:20.000 done all the Eda the next thing we're 00:49:20.000 --> 00:49:23.160 going to do is we are going to do what 00:49:23.160 --> 00:49:26.200 is called feature engineering so we are 00:49:26.200 --> 00:49:28.760 going to transform our original feature 00:49:28.760 --> 00:49:31.280 variables and these are our original 00:49:31.280 --> 00:49:32.960 feature variables right these are our 00:49:32.960 --> 00:49:35.040 original feature variables and we are 00:49:35.040 --> 00:49:37.760 going to transform them all right we're 00:49:37.760 --> 00:49:40.319 going to transform them in some sense uh 00:49:40.319 --> 00:49:43.760 into some other form before we fit this 00:49:43.760 --> 00:49:45.640 for training into our machine learning 00:49:45.640 --> 00:49:48.599 algorithm all right so these are 00:49:48.599 --> 00:49:51.599 examples of let's say this example of a 00:49:51.599 --> 00:49:55.200 original data set right and this is 00:49:55.200 --> 00:49:56.839 examples these are some of the examples 00:49:56.839 --> 00:49:58.040 you don't have to use all of them but 00:49:58.040 --> 00:49:59.440 these are some of examples of what we 00:49:59.440 --> 00:50:00.839 call feature engineering which you can 00:50:00.839 --> 00:50:03.559 then transform your original values in 00:50:03.559 --> 00:50:05.280 your feature variables to all these 00:50:05.280 --> 00:50:07.920 transform values here so we're going to 00:50:07.920 --> 00:50:09.680 pretty much do that here so we have a 00:50:09.680 --> 00:50:12.599 ordinal encoding we do scaling of the 00:50:12.599 --> 00:50:14.839 data so the data set is scaled we use a 00:50:14.839 --> 00:50:18.240 minmax scaling and then finally we come 00:50:18.240 --> 00:50:21.720 to do a modeling so we have to split our 00:50:21.720 --> 00:50:24.359 data set into a training data set and a 00:50:24.359 --> 00:50:28.640 test data set so coming back to again um 00:50:28.640 --> 00:50:32.160 we said that in a before you train your 00:50:32.160 --> 00:50:33.799 model sorry before you train your model 00:50:33.799 --> 00:50:35.599 you have to take your original data set 00:50:35.599 --> 00:50:37.319 now this is a featured engineered data 00:50:37.319 --> 00:50:38.839 set we're going to break it into two or 00:50:38.839 --> 00:50:40.839 more subsets okay so one is called the 00:50:40.839 --> 00:50:42.400 training data set that we use to Feit 00:50:42.400 --> 00:50:44.000 and train a machine learning model the 00:50:44.000 --> 00:50:45.920 second is test data set to evaluate the 00:50:45.920 --> 00:50:47.960 accuracy of the model okay so we got 00:50:47.960 --> 00:50:50.559 this training data set your test data 00:50:50.559 --> 00:50:52.720 set and we also need 00:50:52.720 --> 00:50:56.160 to sample so from our original data set 00:50:56.160 --> 00:50:57.400 we need to sample sample some points 00:50:57.400 --> 00:50:58.839 that go into your training data set some 00:50:58.839 --> 00:51:00.559 points that go in your test data set so 00:51:00.559 --> 00:51:02.720 there are many ways to do sampling one 00:51:02.720 --> 00:51:04.920 way is to do stratified sampling where 00:51:04.920 --> 00:51:06.720 we ensure the same proportion of data 00:51:06.720 --> 00:51:09.000 from each steta or class because right 00:51:09.000 --> 00:51:10.960 now we have a multiclass classification 00:51:10.960 --> 00:51:12.319 problem so you want to make sure the 00:51:12.319 --> 00:51:13.960 same proportion of data from each TR 00:51:13.960 --> 00:51:15.839 class is equally proportional in the 00:51:15.839 --> 00:51:17.920 training and test data set as the 00:51:17.920 --> 00:51:20.119 original data set which is very useful 00:51:20.119 --> 00:51:21.640 for dealing with what is called an 00:51:21.640 --> 00:51:24.319 imbalanced data set so here we have an 00:51:24.319 --> 00:51:25.839 example of what is called an imbalanced 00:51:25.839 --> 00:51:29.520 data set in the sense that you have the 00:51:29.520 --> 00:51:32.760 vast majority of data points in your 00:51:32.760 --> 00:51:34.960 data set they are going to have the 00:51:34.960 --> 00:51:37.480 value of zero for their target variable 00:51:37.480 --> 00:51:40.200 colume so only a extremely small 00:51:40.200 --> 00:51:43.119 minority of the data points in your data 00:51:43.119 --> 00:51:45.319 set will actually have the value of one 00:51:45.319 --> 00:51:48.720 for their target variable colume okay so 00:51:48.720 --> 00:51:51.040 a situation where you have your class or 00:51:51.040 --> 00:51:52.520 your target variable colume where the 00:51:52.520 --> 00:51:54.480 vast majority of values are from one 00:51:54.480 --> 00:51:58.119 class and a tiny small minority are from 00:51:58.119 --> 00:52:00.520 another class we call this an imbalanced 00:52:00.520 --> 00:52:02.720 data set and for an imbalanced data set 00:52:02.720 --> 00:52:04.319 typically we will have a specific 00:52:04.319 --> 00:52:05.920 technique to do the train test split 00:52:05.920 --> 00:52:08.119 which is called stratified sampling and 00:52:08.119 --> 00:52:09.599 so that's what's exactly happening here 00:52:09.599 --> 00:52:12.000 we're doing a stratified split here so 00:52:12.000 --> 00:52:14.839 we are doing a train test split here uh 00:52:14.839 --> 00:52:17.520 and we are doing a stratified split uh 00:52:17.520 --> 00:52:20.359 and then now we actually develop the 00:52:20.359 --> 00:52:23.359 models so now we've got the train test 00:52:23.359 --> 00:52:25.480 plate now here is where we actually 00:52:25.480 --> 00:52:27.079 train the models 00:52:27.079 --> 00:52:29.920 now in terms of classification there are 00:52:29.920 --> 00:52:32.319 a whole bunch of 00:52:32.319 --> 00:52:35.400 possibilities right that you can use 00:52:35.400 --> 00:52:38.480 there are many many different algorithms 00:52:38.480 --> 00:52:41.000 that we can use to create a 00:52:41.000 --> 00:52:42.839 classification model so this are an 00:52:42.839 --> 00:52:45.079 example of some of the more common ones 00:52:45.079 --> 00:52:47.480 logistic support Vector machine decision 00:52:47.480 --> 00:52:49.520 trees random Forest bagging balance 00:52:49.520 --> 00:52:52.720 bagging boost assemble Ensemble so all 00:52:52.720 --> 00:52:55.040 these are different algorithms which 00:52:55.040 --> 00:52:57.760 will create different kind of models 00:52:57.760 --> 00:53:01.599 which will result in different accuracy 00:53:01.599 --> 00:53:05.400 measures okay so it's the goal of the 00:53:05.400 --> 00:53:08.920 data scientist to find the best model 00:53:08.920 --> 00:53:11.520 that gives the best accuracy for the 00:53:11.520 --> 00:53:14.119 given data set for training on that 00:53:14.119 --> 00:53:16.880 given data set so let's head back again 00:53:16.880 --> 00:53:19.760 to uh our machine learning workflow so 00:53:19.760 --> 00:53:21.520 here basically what I'm doing is I'm 00:53:21.520 --> 00:53:23.520 creating a whole bunch of models here 00:53:23.520 --> 00:53:25.520 all right so one is a random Forest one 00:53:25.520 --> 00:53:27.160 is balance bagging one is a boost 00:53:27.160 --> 00:53:29.520 classifier one's The Ensemble classifier 00:53:29.520 --> 00:53:32.760 and using all of these I am going to 00:53:32.760 --> 00:53:35.319 basically Feit or train my model using 00:53:35.319 --> 00:53:37.440 all these algorithms and then I'm going 00:53:37.440 --> 00:53:39.799 to evaluate them okay I'm going to 00:53:39.799 --> 00:53:42.480 evaluate how good each of these models 00:53:42.480 --> 00:53:45.760 are and here you can see your value your 00:53:45.760 --> 00:53:48.839 evaluation data right okay and this is 00:53:48.839 --> 00:53:50.839 the confusion Matrix which is another 00:53:50.839 --> 00:53:54.280 way of evaluating so now we come to the 00:53:54.280 --> 00:53:56.319 kind of the the the key part here which 00:53:56.319 --> 00:53:58.520 is which is how do I distinguish between 00:53:58.520 --> 00:54:00.079 all these models right I've got all 00:54:00.079 --> 00:54:01.400 these different models which are built 00:54:01.400 --> 00:54:03.040 with different algorithms which I'm 00:54:03.040 --> 00:54:05.359 using to train on the same data set how 00:54:05.359 --> 00:54:07.359 do I distinguish between all these 00:54:07.359 --> 00:54:10.359 models okay and so for that sense for 00:54:10.359 --> 00:54:13.880 that we actually have a whole bunch of 00:54:13.880 --> 00:54:16.200 common evaluation matrics for 00:54:16.200 --> 00:54:18.319 classification right so this evaluation 00:54:18.319 --> 00:54:22.240 matrics tell us how good a model is in 00:54:22.240 --> 00:54:24.319 terms of its accuracy in 00:54:24.319 --> 00:54:27.000 classification so in terms of 00:54:27.000 --> 00:54:29.440 accuracy we actually have many different 00:54:29.440 --> 00:54:31.680 models uh sorry many different measures 00:54:31.680 --> 00:54:33.440 right you might think well accuracy is 00:54:33.440 --> 00:54:35.400 just accuracy well that's all right it's 00:54:35.400 --> 00:54:36.880 just either it's accurate or it's not 00:54:36.880 --> 00:54:39.319 accurate right but actually it's not 00:54:39.319 --> 00:54:41.359 that simple there are many different 00:54:41.359 --> 00:54:43.839 ways to measure the accuracy of a 00:54:43.839 --> 00:54:45.480 classification model and these are some 00:54:45.480 --> 00:54:48.280 of the more common ones so for example 00:54:48.280 --> 00:54:51.000 the confusion metrix tells us how many 00:54:51.000 --> 00:54:54.000 true positives that means the value is 00:54:54.000 --> 00:54:55.880 positive the prediction is positive how 00:54:55.880 --> 00:54:57.520 many false FAL positives which means the 00:54:57.520 --> 00:54:59.040 value is negative the machine learning 00:54:59.040 --> 00:55:01.839 model predicts positive how many false 00:55:01.839 --> 00:55:03.839 negatives which means that the machine 00:55:03.839 --> 00:55:05.559 learning model predicts negative but 00:55:05.559 --> 00:55:07.480 it's actually positive and how many true 00:55:07.480 --> 00:55:09.359 negatives there are which means that the 00:55:09.359 --> 00:55:11.240 machine the machine learning model 00:55:11.240 --> 00:55:12.880 predicts negative and the true value is 00:55:12.880 --> 00:55:14.760 also negative so this is called a 00:55:14.760 --> 00:55:16.920 confusion Matrix this is one way we 00:55:16.920 --> 00:55:19.480 assess or evaluate the performance of a 00:55:19.480 --> 00:55:20.520 classification 00:55:20.520 --> 00:55:23.319 model okay this is for binary 00:55:23.319 --> 00:55:24.680 classification we can also have 00:55:24.680 --> 00:55:26.880 multiclass confusion Matrix 00:55:26.880 --> 00:55:29.000 and then we can also measure things like 00:55:29.000 --> 00:55:31.720 accuracy so accuracy is the true 00:55:31.720 --> 00:55:34.079 positives plus the true negatives which 00:55:34.079 --> 00:55:35.440 is the total number of correct 00:55:35.440 --> 00:55:37.839 predictions made by the model divided by 00:55:37.839 --> 00:55:39.839 the total number of data points in your 00:55:39.839 --> 00:55:42.599 data set and then you have also other 00:55:42.599 --> 00:55:43.720 kinds of 00:55:43.720 --> 00:55:46.599 measures uh such as recall and this is a 00:55:46.599 --> 00:55:49.160 formula for recall this is a formula for 00:55:49.160 --> 00:55:51.480 the F1 score okay and then there's 00:55:51.480 --> 00:55:55.559 something called the uh R curve right so 00:55:55.559 --> 00:55:57.039 without going too much in the detail of 00:55:57.039 --> 00:55:59.000 what each of these entails essentially 00:55:59.000 --> 00:56:00.640 these are all different ways these are 00:56:00.640 --> 00:56:03.280 different kpi right just like if you 00:56:03.280 --> 00:56:06.119 work in a company you have different kpi 00:56:06.119 --> 00:56:08.079 right certain employees have certain kpi 00:56:08.079 --> 00:56:11.280 that measures how good or how how uh you 00:56:11.280 --> 00:56:13.200 know efficient or how effective a 00:56:13.200 --> 00:56:16.240 particular employee is right so the 00:56:16.240 --> 00:56:19.880 kpi kpi for your machine learning models 00:56:19.880 --> 00:56:24.240 are Roc curve F1 score recall accuracy 00:56:24.240 --> 00:56:26.599 okay and your confusion Matrix so so 00:56:26.599 --> 00:56:29.839 fundamentally after I have built right 00:56:29.839 --> 00:56:33.359 so here I've built my four different 00:56:33.359 --> 00:56:35.240 models so after I built these form 00:56:35.240 --> 00:56:37.640 different models I'm going to check and 00:56:37.640 --> 00:56:39.680 evaluate them using all those different 00:56:39.680 --> 00:56:42.440 metrics like for example the F1 score 00:56:42.440 --> 00:56:44.839 the Precision score the recall score all 00:56:44.839 --> 00:56:47.319 right so for this model I can check out 00:56:47.319 --> 00:56:50.039 the ROC score the F1 score the Precision 00:56:50.039 --> 00:56:52.119 score the recall score then for this 00:56:52.119 --> 00:56:54.799 model this is the ROC score the F1 score 00:56:54.799 --> 00:56:56.839 the Precision score the recall called 00:56:56.839 --> 00:56:59.680 then for this model and so on so for 00:56:59.680 --> 00:57:03.240 every single model I've created using my 00:57:03.240 --> 00:57:05.839 training data set I will have all my set 00:57:05.839 --> 00:57:08.000 of evaluation metrics that I can use to 00:57:08.000 --> 00:57:11.839 evaluate how good this model is okay 00:57:11.839 --> 00:57:13.119 same thing here I've got a confusion 00:57:13.119 --> 00:57:15.079 Matrix here right so I can use that 00:57:15.079 --> 00:57:18.119 again to evaluate between all these four 00:57:18.119 --> 00:57:20.200 different models and then I kind of 00:57:20.200 --> 00:57:22.240 summarize it up here so we can see from 00:57:22.240 --> 00:57:25.440 this summary here that actually the top 00:57:25.440 --> 00:57:27.599 two models right which are I'm going to 00:57:27.599 --> 00:57:29.440 give a lot as a data scientist I'm now 00:57:29.440 --> 00:57:31.119 going to just focus on these two models 00:57:31.119 --> 00:57:33.440 so these two models are begging 00:57:33.440 --> 00:57:36.000 classifier and random Forest classifier 00:57:36.000 --> 00:57:38.480 they have the highest values of F1 score 00:57:38.480 --> 00:57:40.480 and the highest values of the rooc curve 00:57:40.480 --> 00:57:42.640 score okay so we can say these are the 00:57:42.640 --> 00:57:45.839 top two models in terms of accuracy okay 00:57:45.839 --> 00:57:48.920 using the fub1 evaluation metric and the 00:57:48.920 --> 00:57:53.720 r Au evaluation metric okay so these 00:57:53.720 --> 00:57:57.480 results uh kind of summarize here and 00:57:57.480 --> 00:57:59.079 then we use different sampling 00:57:59.079 --> 00:58:00.880 techniques okay so just now I talked 00:58:00.880 --> 00:58:03.680 about um different kinds of sampling 00:58:03.680 --> 00:58:06.400 techniques and so the idea of different 00:58:06.400 --> 00:58:08.319 kinds of sampling techniques is to just 00:58:08.319 --> 00:58:11.319 get a different feel for different 00:58:11.319 --> 00:58:13.720 distributions of the data in different 00:58:13.720 --> 00:58:16.359 areas of your data set so that you want 00:58:16.359 --> 00:58:20.000 to just kind of make sure that your your 00:58:20.000 --> 00:58:22.799 your evaluation of accuracy is actually 00:58:22.799 --> 00:58:27.079 statistically correct right so we can um 00:58:27.079 --> 00:58:29.599 do what is called oversampling and under 00:58:29.599 --> 00:58:30.880 sampling which is very useful when 00:58:30.880 --> 00:58:32.280 you're working with an imbalance data 00:58:32.280 --> 00:58:35.039 set so this is example of doing that and 00:58:35.039 --> 00:58:37.240 then here we again again check out the 00:58:37.240 --> 00:58:38.799 results for all these different 00:58:38.799 --> 00:58:41.680 techniques we use uh the F1 score the Au 00:58:41.680 --> 00:58:43.599 score all right these are the two key 00:58:43.599 --> 00:58:46.760 measures of accuracy right so and then 00:58:46.760 --> 00:58:47.920 we can check out the scores for the 00:58:47.920 --> 00:58:50.480 different approaches okay so we can see 00:58:50.480 --> 00:58:53.119 oh well overall the models have lower Au 00:58:53.119 --> 00:58:55.720 r r Au C score but they have a much 00:58:55.720 --> 00:58:58.280 higher F1 score the begging classifier 00:58:58.280 --> 00:59:00.839 had the highest R1 highest roc1 score 00:59:00.839 --> 00:59:04.119 but F1 score was too low okay then in 00:59:04.119 --> 00:59:06.520 the data scientist opinion the random 00:59:06.520 --> 00:59:08.520 forest with this particular technique of 00:59:08.520 --> 00:59:10.760 sampling has equilibrium between the F1 00:59:10.760 --> 00:59:14.480 R F1 R and A score so the takeaway one 00:59:14.480 --> 00:59:16.680 is the macro F1 score improves 00:59:16.680 --> 00:59:18.480 dramatically using the sampl sampling 00:59:18.480 --> 00:59:20.160 techniqu so these models might be better 00:59:20.160 --> 00:59:22.440 compared to the balanced ones all right 00:59:22.440 --> 00:59:26.280 so based on all this uh evaluation the 00:59:26.280 --> 00:59:27.680 data scientist says they're going to 00:59:27.680 --> 00:59:29.920 continue to work with these two models 00:59:29.920 --> 00:59:31.440 all right and the balance begging one 00:59:31.440 --> 00:59:33.079 and then continue to make further 00:59:33.079 --> 00:59:35.039 comparisons all right so then we 00:59:35.039 --> 00:59:37.079 continue to keep refining on our 00:59:37.079 --> 00:59:38.599 evaluation work here we're going to 00:59:38.599 --> 00:59:41.000 train the models one more time again so 00:59:41.000 --> 00:59:43.039 we again do a training test plate and 00:59:43.039 --> 00:59:44.799 then we do that for this particular uh 00:59:44.799 --> 00:59:47.039 approach model and then we print out we 00:59:47.039 --> 00:59:48.200 print out what is called a 00:59:48.200 --> 00:59:50.960 classification report and this is 00:59:50.960 --> 00:59:53.400 basically a summary of all those metrics 00:59:53.400 --> 00:59:55.359 that I talk about just now so just now 00:59:55.359 --> 00:59:57.520 remember I said the the there was 00:59:57.520 --> 00:59:59.680 several evaluation metrics right so uh 00:59:59.680 --> 01:00:01.480 we had the confusion matrics the 01:00:01.480 --> 01:00:04.119 accuracy the Precision the recall the Au 01:00:04.119 --> 01:00:08.119 ccore so here with the um classification 01:00:08.119 --> 01:00:09.880 report I can get a summary of all of 01:00:09.880 --> 01:00:11.760 that so I can see all the values here 01:00:11.760 --> 01:00:14.640 okay for this particular model begging 01:00:14.640 --> 01:00:17.160 Tomac links and then I can do that for 01:00:17.160 --> 01:00:18.640 another model the random Forest 01:00:18.640 --> 01:00:20.599 borderline SME and then I can do that 01:00:20.599 --> 01:00:22.200 for another model which is the balance 01:00:22.200 --> 01:00:25.160 ping so again we see this a lot of 01:00:25.160 --> 01:00:27.079 comparison between different models 01:00:27.079 --> 01:00:28.640 trying to figure out what all these 01:00:28.640 --> 01:00:30.720 evaluation metrics are telling us all 01:00:30.720 --> 01:00:32.960 right then again we have a confusion 01:00:32.960 --> 01:00:35.880 Matrix so we generate a confusion Matrix 01:00:35.880 --> 01:00:38.880 for the bagging with the toac links 01:00:38.880 --> 01:00:40.720 under sampling for the random followers 01:00:40.720 --> 01:00:42.680 with the borderline mod over sampling 01:00:42.680 --> 01:00:44.960 and just balance begging by itself then 01:00:44.960 --> 01:00:47.720 again we compare between these three uh 01:00:47.720 --> 01:00:50.799 models uh using the confusion Matrix 01:00:50.799 --> 01:00:52.599 evaluation Matrix and then we can kind 01:00:52.599 --> 01:00:55.680 of come to some conclusions all right so 01:00:55.680 --> 01:00:58.160 right so now we look at all the data 01:00:58.160 --> 01:01:01.200 then we move on and look at another um 01:01:01.200 --> 01:01:03.160 another kind of evaluation metrix which 01:01:03.160 --> 01:01:06.720 is the r score right so this is one of 01:01:06.720 --> 01:01:08.680 the other evaluation metrics I talk 01:01:08.680 --> 01:01:11.200 about so this one is a kind of a curve 01:01:11.200 --> 01:01:12.520 you look at it to see the area 01:01:12.520 --> 01:01:14.359 underneath the curve this is called AOC 01:01:14.359 --> 01:01:18.079 R area under the curve sorry Au Au R 01:01:18.079 --> 01:01:19.880 area under the curve all right so the 01:01:19.880 --> 01:01:21.839 area under the curve uh 01:01:21.839 --> 01:01:24.319 score will give us some idea about the 01:01:24.319 --> 01:01:25.599 threshold that we're going to use for 01:01:25.599 --> 01:01:27.680 classif ification so we can examine this 01:01:27.680 --> 01:01:29.200 for the bagging classifier for the 01:01:29.200 --> 01:01:30.960 random forest classifier for the balance 01:01:30.960 --> 01:01:33.599 bagging classifier okay then we can also 01:01:33.599 --> 01:01:36.200 again do that uh finally we can check 01:01:36.200 --> 01:01:37.880 the classification report of this 01:01:37.880 --> 01:01:39.680 particular model so we keep doing this 01:01:39.680 --> 01:01:43.200 over and over again evaluating this m 01:01:43.200 --> 01:01:45.720 The Matrix the the accuracy Matrix the 01:01:45.720 --> 01:01:46.880 evaluation Matrix for all these 01:01:46.880 --> 01:01:48.880 different models so we keep doing this 01:01:48.880 --> 01:01:50.520 over and over again for different 01:01:50.520 --> 01:01:53.440 thresholds or for classification and so 01:01:53.440 --> 01:01:56.880 as we keep drilling into these we kind 01:01:56.880 --> 01:02:00.839 of get more and more understanding of 01:02:00.839 --> 01:02:02.799 all these different models which one is 01:02:02.799 --> 01:02:04.760 the best one that gives the best 01:02:04.760 --> 01:02:08.520 performance for our data set okay so 01:02:08.520 --> 01:02:11.440 finally we come to this conclusion this 01:02:11.440 --> 01:02:13.520 particular model is not able to reduce 01:02:13.520 --> 01:02:15.279 the record on failure test than 01:02:15.279 --> 01:02:17.520 95.8% on the other hand balance begging 01:02:17.520 --> 01:02:19.400 with a decision thresold of 0.6 is able 01:02:19.400 --> 01:02:21.520 to have a better recall blah blah blah 01:02:21.520 --> 01:02:25.319 Etc so finally after having done all of 01:02:25.319 --> 01:02:27.480 this evalu ations 01:02:27.480 --> 01:02:31.119 okay this is the conclusion 01:02:31.119 --> 01:02:33.960 so after having gone so right now we 01:02:33.960 --> 01:02:35.279 have gone through all the steps of the 01:02:35.279 --> 01:02:37.760 Machining learning life cycle and which 01:02:37.760 --> 01:02:40.240 means we have right now or the data 01:02:40.240 --> 01:02:41.960 scientist right now has gone through all 01:02:41.960 --> 01:02:43.000 these 01:02:43.000 --> 01:02:47.079 steps uh which is now we have done this 01:02:47.079 --> 01:02:48.640 validation so we have done the cleaning 01:02:48.640 --> 01:02:50.559 exploration preparation transformation 01:02:50.559 --> 01:02:52.599 the future engineering we have developed 01:02:52.599 --> 01:02:54.359 and trained multiple models we have 01:02:54.359 --> 01:02:56.480 evaluated all these different models so 01:02:56.480 --> 01:02:58.599 right now we have reached this stage so 01:02:58.599 --> 01:03:02.720 at this stage we as the data scientist 01:03:02.720 --> 01:03:05.480 kind of have completed our job so we've 01:03:05.480 --> 01:03:08.119 come to some very useful conclusions 01:03:08.119 --> 01:03:09.640 which we now can share with our 01:03:09.640 --> 01:03:13.240 colleagues all right and based on this 01:03:13.240 --> 01:03:15.400 uh conclusions or recommendations 01:03:15.400 --> 01:03:17.160 somebody is going to choose a 01:03:17.160 --> 01:03:19.160 appropriate model and that model is 01:03:19.160 --> 01:03:22.640 going to get deployed for realtime use 01:03:22.640 --> 01:03:25.319 in a real life production environment 01:03:25.319 --> 01:03:27.240 okay and that decision is going to be 01:03:27.240 --> 01:03:29.359 made based on the recommendations coming 01:03:29.359 --> 01:03:30.880 from the data scientist at the end of 01:03:30.880 --> 01:03:33.480 this phase okay so at the end of this 01:03:33.480 --> 01:03:35.079 phase the data scientist is going to 01:03:35.079 --> 01:03:36.880 come up with these conclusions so 01:03:36.880 --> 01:03:41.760 conclusions is okay if the engineering 01:03:41.760 --> 01:03:44.520 team they are looking okay the 01:03:44.520 --> 01:03:46.119 engineering team right the engineering 01:03:46.119 --> 01:03:48.720 team if they are looking for the highest 01:03:48.720 --> 01:03:51.839 failure detection rate possible then 01:03:51.839 --> 01:03:54.480 they should go with this particular 01:03:54.480 --> 01:03:56.520 model okay 01:03:56.520 --> 01:03:58.680 and if they want a balance between 01:03:58.680 --> 01:04:01.039 precision and recall then they should 01:04:01.039 --> 01:04:03.240 choose between the begging model with a 01:04:03.240 --> 01:04:05.960 0.4 decision threshold or the random 01:04:05.960 --> 01:04:09.599 forest model with a 0.5 threshold but if 01:04:09.599 --> 01:04:11.880 they don't care so much about predicting 01:04:11.880 --> 01:04:14.480 every failure and they want the highest 01:04:14.480 --> 01:04:16.760 Precision possible then they should opt 01:04:16.760 --> 01:04:19.799 for the begging toax link classifier 01:04:19.799 --> 01:04:23.160 with a bit higher decision threshold and 01:04:23.160 --> 01:04:26.160 so this is the key thing that the data 01:04:26.160 --> 01:04:28.319 scientist is going to give right this is 01:04:28.319 --> 01:04:30.760 the key takeaway this is the kind of the 01:04:30.760 --> 01:04:32.680 end result of the entire machine 01:04:32.680 --> 01:04:34.680 learning life cycle right now the data 01:04:34.680 --> 01:04:36.400 scientist is going to tell the 01:04:36.400 --> 01:04:38.599 engineering team all right you guys 01:04:38.599 --> 01:04:41.160 which is more important for you point a 01:04:41.160 --> 01:04:45.039 point B or Point C make your decision so 01:04:45.039 --> 01:04:47.400 the engineering team will then discuss 01:04:47.400 --> 01:04:48.960 among themselves and say hey you know 01:04:48.960 --> 01:04:52.279 what what we want is we want to get the 01:04:52.279 --> 01:04:54.720 highest failure detection possible 01:04:54.720 --> 01:04:58.359 because any kind kind of failure of that 01:04:58.359 --> 01:05:00.400 machine or the product on the samply 01:05:00.400 --> 01:05:03.119 line is really going to screw us up big 01:05:03.119 --> 01:05:05.640 time so what we're looking for is the 01:05:05.640 --> 01:05:08.079 model that will give us the highest 01:05:08.079 --> 01:05:10.880 failure detection rate we don't care 01:05:10.880 --> 01:05:13.480 about Precision but we want to be make 01:05:13.480 --> 01:05:15.440 sure that if there's a failure we are 01:05:15.440 --> 01:05:17.720 going to catch it right so that's what 01:05:17.720 --> 01:05:19.599 they want and so the data scientist will 01:05:19.599 --> 01:05:22.200 say Hey you go for the balance begging 01:05:22.200 --> 01:05:24.880 model okay then the data scientist saves 01:05:24.880 --> 01:05:27.720 this all right uh and then once you have 01:05:27.720 --> 01:05:30.000 saved this uh you can then go right 01:05:30.000 --> 01:05:32.319 ahead and deploy that so you can go 01:05:32.319 --> 01:05:33.520 right ahead and deploy that to 01:05:33.520 --> 01:05:37.160 production okay and so if you want to 01:05:37.160 --> 01:05:38.839 continue we can actually further 01:05:38.839 --> 01:05:41.119 continue this modeling problem so just 01:05:41.119 --> 01:05:43.480 now I model this problem as a binary 01:05:43.480 --> 01:05:46.720 classification problem uh sorry just I 01:05:46.720 --> 01:05:48.240 modeled this problem as a binary 01:05:48.240 --> 01:05:49.520 classification which means it's either 01:05:49.520 --> 01:05:51.680 zero or one either fail or not fail but 01:05:51.680 --> 01:05:53.599 we can also model it as a multiclass 01:05:53.599 --> 01:05:55.640 classification problem right because as 01:05:55.640 --> 01:05:57.640 as I said earlier just now for the 01:05:57.640 --> 01:06:00.200 Target variable colum which is sorry for 01:06:00.200 --> 01:06:02.520 the failure type colume you actually 01:06:02.520 --> 01:06:04.839 have multiple kinds of failures right 01:06:04.839 --> 01:06:07.559 for example you may have a power failure 01:06:07.559 --> 01:06:10.000 uh you may have a towar failure uh you 01:06:10.000 --> 01:06:12.920 may have a overstrain failure so now we 01:06:12.920 --> 01:06:14.839 can model the problem slightly 01:06:14.839 --> 01:06:17.240 differently so we can model it as a 01:06:17.240 --> 01:06:19.680 multiclass classification problem and 01:06:19.680 --> 01:06:21.160 then we go through the entire same 01:06:21.160 --> 01:06:22.680 process that we went through just now so 01:06:22.680 --> 01:06:24.880 we create different models we test this 01:06:24.880 --> 01:06:26.720 out but now the confusion Matrix is for 01:06:26.720 --> 01:06:30.119 a multiclass classification isue right 01:06:30.119 --> 01:06:30.960 so we're going 01:06:30.960 --> 01:06:34.039 to check them out we're going to again 01:06:34.039 --> 01:06:36.079 uh try different algorithms or models 01:06:36.079 --> 01:06:38.039 again train and test our data set do the 01:06:38.039 --> 01:06:39.760 training test split uh on these 01:06:39.760 --> 01:06:42.000 different models all right so we have 01:06:42.000 --> 01:06:43.400 like for example we have bon random 01:06:43.400 --> 01:06:46.160 Forest B random Forest a great search 01:06:46.160 --> 01:06:47.720 then you train the models using what is 01:06:47.720 --> 01:06:49.680 called hyperparameter tuning then you 01:06:49.680 --> 01:06:51.079 get the scores all right so you get the 01:06:51.079 --> 01:06:53.160 same evaluation scores again you check 01:06:53.160 --> 01:06:54.599 out the evaluation scores compare 01:06:54.599 --> 01:06:57.079 between them generate a confusion Matrix 01:06:57.079 --> 01:06:59.960 so this is a multiclass confusion Matrix 01:06:59.960 --> 01:07:02.400 and then you come to the final 01:07:02.400 --> 01:07:05.760 conclusion so now if you are interested 01:07:05.760 --> 01:07:09.000 to frame your problem domain as a 01:07:09.000 --> 01:07:11.359 multiclass classification problem all 01:07:11.359 --> 01:07:13.839 right then these are the recommendations 01:07:13.839 --> 01:07:15.480 from the data scientist so the data 01:07:15.480 --> 01:07:17.240 scientist will say you know what I'm 01:07:17.240 --> 01:07:19.559 going to pick this particular model the 01:07:19.559 --> 01:07:22.039 balance backing classifier and these are 01:07:22.039 --> 01:07:24.520 all the reasons that the data scientist 01:07:24.520 --> 01:07:27.279 is going to give as a rational for 01:07:27.279 --> 01:07:29.400 selecting this particular 01:07:29.400 --> 01:07:32.039 model and then once that's done you save 01:07:32.039 --> 01:07:35.000 the model and that's that's it that's it 01:07:35.000 --> 01:07:38.920 so that's all done now and so then the 01:07:38.920 --> 01:07:41.039 uh the model the machine learning model 01:07:41.039 --> 01:07:43.720 now you can put it live run it on the 01:07:43.720 --> 01:07:45.279 server and now the machine learning 01:07:45.279 --> 01:07:47.200 model is ready to work which means it's 01:07:47.200 --> 01:07:48.920 ready to generate predictions right 01:07:48.920 --> 01:07:50.279 that's the main job of the machine 01:07:50.279 --> 01:07:52.039 learning model you have picked the best 01:07:52.039 --> 01:07:53.680 machine learning model with the best 01:07:53.680 --> 01:07:55.799 evaluation metrics for whatever accur 01:07:55.799 --> 01:07:57.760 see goal you're trying to achieve and 01:07:57.760 --> 01:07:59.640 now you're going to run it on a server 01:07:59.640 --> 01:08:00.799 and now you're going to get all this 01:08:00.799 --> 01:08:02.960 real time data that's coming from your 01:08:02.960 --> 01:08:04.520 sensus you're going to pump that into 01:08:04.520 --> 01:08:06.359 your machine learning model your machine 01:08:06.359 --> 01:08:07.880 learning model will pump out a whole 01:08:07.880 --> 01:08:09.520 bunch of predictions and we're going to 01:08:09.520 --> 01:08:12.799 use that predictions in real time to 01:08:12.799 --> 01:08:15.400 make real time real world decision 01:08:15.400 --> 01:08:17.560 making right you're going to say okay 01:08:17.560 --> 01:08:19.600 I'm predicting that that machine is 01:08:19.600 --> 01:08:23.198 going to fail on Thursday at 5:00 p.m. 01:08:23.198 --> 01:08:25.520 so you better get your service folks in 01:08:25.520 --> 01:08:28.640 to service it on Thursday 2: p.m. or you 01:08:28.640 --> 01:08:31.640 know whatever so you can you know uh 01:08:31.640 --> 01:08:33.479 make decisions on when you want to do 01:08:33.479 --> 01:08:35.319 your maintenance you know and and make 01:08:35.319 --> 01:08:37.640 the best decisions to optimize the cost 01:08:37.640 --> 01:08:41.158 of Maintenance etc etc and then based on 01:08:41.158 --> 01:08:42.120 the 01:08:42.120 --> 01:08:45.000 results that are coming up from the 01:08:45.000 --> 01:08:46.759 predictions so the predictions may be 01:08:46.759 --> 01:08:49.120 good the predictions may be lousy the 01:08:49.120 --> 01:08:51.359 predictions may be average right so we 01:08:51.359 --> 01:08:53.719 are we're constantly monitoring how good 01:08:53.719 --> 01:08:55.439 or how useful are the predictions 01:08:55.439 --> 01:08:57.759 generated by this realtime model that's 01:08:57.759 --> 01:08:59.880 running on the server and based on our 01:08:59.880 --> 01:09:02.679 monitoring we will then take some new 01:09:02.679 --> 01:09:05.319 data and then repeat this entire life 01:09:05.319 --> 01:09:07.040 cycle again so this is basically a 01:09:07.040 --> 01:09:09.238 workflow that's iterative and we are 01:09:09.238 --> 01:09:11.120 constantly or the data scientist is 01:09:11.120 --> 01:09:13.319 constantly getting in all these new data 01:09:13.319 --> 01:09:15.279 points and then refining the model 01:09:15.279 --> 01:09:17.960 picking maybe a new model deploying the 01:09:17.960 --> 01:09:21.679 new model onto the server and so on all 01:09:21.679 --> 01:09:23.920 right and so that's it so that is 01:09:23.920 --> 01:09:26.399 basically your machine learning workflow 01:09:26.399 --> 01:09:29.479 in a nutshell okay so for this 01:09:29.479 --> 01:09:32.080 particular approach we have used a bunch 01:09:32.080 --> 01:09:34.560 of uh data science libraries from python 01:09:34.560 --> 01:09:36.520 so we have used pandas which is the most 01:09:36.520 --> 01:09:38.560 B basic data science libraries that 01:09:38.560 --> 01:09:40.279 provides all the tools to work with raw 01:09:40.279 --> 01:09:42.520 data we have used numai which is a high 01:09:42.520 --> 01:09:44.080 performance library for implementing 01:09:44.080 --> 01:09:46.439 complex array metrix operations we have 01:09:46.439 --> 01:09:49.560 used met plot lip and cbon which is used 01:09:49.560 --> 01:09:52.439 for doing the Eda the explorat 01:09:52.439 --> 01:09:55.560 exploratory data analysis phase machine 01:09:55.560 --> 01:09:57.040 learning where you visualize all your 01:09:57.040 --> 01:09:59.040 data we have used psyit learn which is 01:09:59.040 --> 01:10:01.280 the machine L learning library to do all 01:10:01.280 --> 01:10:02.920 your implementation for all your call 01:10:02.920 --> 01:10:06.000 machine learning algorithms uh we we we 01:10:06.000 --> 01:10:08.000 have not used this because this is not a 01:10:08.000 --> 01:10:11.040 deep learning uh problem but if you are 01:10:11.040 --> 01:10:12.800 working with a deep learning problem 01:10:12.800 --> 01:10:15.360 like image classification image 01:10:15.360 --> 01:10:17.840 recognition object detection okay 01:10:17.840 --> 01:10:20.199 natural language processing text 01:10:20.199 --> 01:10:21.920 classification well then you're going to 01:10:21.920 --> 01:10:24.360 use these libraries from python which is 01:10:24.360 --> 01:10:28.960 tensor flow okay and also py 01:10:28.960 --> 01:10:32.679 to and then lastly that whole thing that 01:10:32.679 --> 01:10:34.719 whole data science project that you saw 01:10:34.719 --> 01:10:36.800 just now this entire data science 01:10:36.800 --> 01:10:38.880 project is actually developed in 01:10:38.880 --> 01:10:41.080 something called a Jupiter notebook so 01:10:41.080 --> 01:10:44.040 all this python code along with all the 01:10:44.040 --> 01:10:46.360 observations from the data 01:10:46.360 --> 01:10:48.679 scientists okay for this entire data 01:10:48.679 --> 01:10:50.440 science project was actually run in 01:10:50.440 --> 01:10:53.360 something called a Jupiter notebook so 01:10:53.360 --> 01:10:55.760 that is uh the 01:10:55.760 --> 01:10:59.080 most widely used tool for interactively 01:10:59.080 --> 01:11:02.360 developing and presenting data science 01:11:02.360 --> 01:11:04.640 projects okay so that brings me to the 01:11:04.640 --> 01:11:07.400 end of this entire presentation I hope 01:11:07.400 --> 01:11:10.360 that you find it useful for you and that 01:11:10.360 --> 01:11:13.199 you can appreciate the importance of 01:11:13.199 --> 01:11:15.280 machine learning and how it can be 01:11:15.280 --> 01:11:19.800 applied in a real life use case in a 01:11:19.800 --> 01:11:23.360 typical production environment all right 01:11:23.360 --> 01:11:27.239 thank you all so much for watching