Hello everyone, my name is Victor. I'm your friendly neighborhood data scientist from DreamCatcher. So in this presentation, I would like to talk about a specific industry use case of AI or machine learning which is predictive maintenance. So I will be covering these topics and feel free to jump forward to the specific part in the video where I talk about all these topics. So I'm going to start off with a general preview of AI and machine learning. Then, I'll discuss the use case which is predictive maintenance. I'll talk about the basics of machine learning, the workflow of machine learning, and then we will come to the meat of this presentation which is essentially a demonstration of the machine learning workflow from end to end on a real life predictive maintenance domain problem. All right, so without any further ado, let's jump into it. So let's start off with a quick preview of AI and machine learning. Well AI is a very general term, it encompasses the entire area of science and engineering that is related to creating software programs and machines that will be capable of performing tasks that would normally require human intelligence. But AI is a catchall term, so really when we talk about apply AI, how we use AI in our daily work, we are really going to be talking about machine learning. So machine learning is the design and application of software algorithms that are capable of learning on their own without any explicit human intervention. And the primary purpose of these algorithms are to optimize performance in a specific task. And the primary performance or the primary task that you want to optimize performance in is to be able to make accurate predictions about future outcomes based on the analysis of historical data from the past. So essentially machine learning is about making predictions about the future or what we call predictive analytics. And there are many different kinds of algorithms that are available in machine learning under the three primary categories of supervised learning, unsupervised learning, and reinforcement learning. And here we can see some of the different kinds of algorithms and their use cases in various areas in industry. So we have various domain use cases for all these different kind of algorithms, and we can see that different algorithms are fitted for different use cases. Deep learning is an advanced form of machine learning that's based on something called an artificial neural network or ANN for short, and this essentially simulates the structure of the human brain whereby neurons interconnect and work together to process and learn new information. So DL is the foundational technology for most of the popular AI tools that you probably have heard of today. So I'm sure you have heard of ChatGPT if you haven't been living in a cave for the past 2 years. And yeah, so ChatGPT is an example of what we call a large language model and that's based on this technology called deep learning. Also, all the modern computer vision applications where a computer program can classify images or detect images or recognize images on its own, okay, we call this computer vision applications. They also use this particular form of machine learning called deep learning, right? So this is a example of an artificial neural network. For example, here I have an image of a bird that's fed into this artificial neural network, and output from this artificial neural network is a classification of this image into one of these three potential categories. So in this case, if the ANN has been trained properly, we fit in this image, this ANN should correctly classify this image as a bird, right? So this is a image classification problem which is a classic use case for an artificial neural network in the field of computer vision. And just like in the case of machine learning, there are a variety of algorithms that are available for deep learning under the category of supervised learning and also unsupervised learning. All right, so this is how we can kind of categorize this. You can think of AI is a general area of smart systems and machine. Machine learning is basically apply AI and deep learning is a subspecialization of machine learning using a particular architecture called an artificial neural network. And generative AI, so if you talk about ChatGPT, okay, Google Gemini, Microsoft Copilot, okay, all these examples of generative AI, they are basically large language models, and they are a further subcategory within the area of deep learning. And there are many applications of machine learning in industry right now, so pick which particular industry are you involved in, and these are all the specific areas of applications, right? So probably, I'm going to guess the vast majority of you who are watching this video, you're probably coming from the manufacturing industry, and so in the manufacturing industry some of the standard use cases for machine learning and deep learning are predicting potential problems, okay? So sometimes you call this predictive maintenance where you want to predict when a problem is going to happen and then kind of address it before it happens. And then monitoring systems, automating your manufacturing assembly line or production line, okay, smart scheduling, and detecting anomaly on your production line. Okay, so let's talk about the use case here which is predictive maintenance, right? So what is predictive maintenance? Well predictive maintenance, here's the long definition, is a equipment maintenance strategy that relies on real-time monitoring of equipment conditions and data to predict equipment failures in advance. And this uses advanced data models, analytics, and machine learning whereby we can reliably assess when failures are more likely to occur, including which components are more likely to be affected on your production or assembly line. So where does predictive maintenance fit into the overall scheme of things, right? So let's talk about the kind of standard way that, you know, factories or production lines, assembly lines in factories tend to handle maintenance issues say 10 or 20 years ago, right? So what you have is the, what you would probably start off is the most basic mode which is reactive maintenance. So you just wait until your machine breaks down and then you repair, right? The simplest, but, of course, I'm sure if you have worked on a production line for any period of time, you know that this reactive maintenance can give you a whole bunch of headaches especially if the machine breaks down just before a critical delivery deadline, right? Then you're going to have a backlog of orders and you're going to run to a lot of problems. Okay, so we move on to preventive maintenance which is you regularly schedule a maintenance of your production machines to reduce the failure rate. So you might do maintenance once every month, once every two weeks, whatever. Okay, this is great, but the problem, of course, then is well sometimes you're doing too much maintenance, it's not really necessary, and it still doesn't totally prevent this, you know, a failure of the machine that occurs outside of your planned maintenance, right? So a bit of an improvement, but not that much better. And then, these last two categories is where we bring in AI and machine learning. So with machine learning, we're going to use sensors to do real-time monitoring of the data, and then using that data we're going to build a machine learning model which helps us to predict, with a reasonable level of accuracy, when the next failure is going to happen on your assembly or production line on a specific component or specific machine, right? So you just want to be predict to a high level of accuracy like maybe to the specific day, even the specific hour, or even minute itself when you expect that particular product to fail or the particular machine to fail. All right, so these are the advantages of predictive maintenance. It minimizes the occurrence of unscheduled downtime, it gives you a real-time overview of your current condition of assets, ensures minimal disruptions to productivity, optimizes time you spend on maintenance work, optimizes the use of spare parts, and so on. And of course there are some disadvantages, which is the primary one, you need a specialized set of skills among your engineers to understand and create machine learning models that can work on the real-time data that you're getting. Okay, so we're going to take a look at some real life use cases. So these are a bunch of links here, so if you navigate to these links here, you'll be able to get a look at some real life use cases of machine learning in predictive maintenance. So the IBM website, okay, gives you a look at a bunch of five use cases, so you can click on these links and follow up with them if you want to read more. Okay, this is waste management, manufacturing, okay, building services, and renewable energy, and also mining, right? So these are all use cases, if you want to know more about them, you can read up and follow them from this website. And this website gives, this is a pretty good website. I would really encourage you to just look through this if you're interested in predictive maintenance. So here, it tells you about, you know, an industry survey of predictive maintenance. We can see that a large portion of the industry, manufacturing industry agreed that predictive maintenance is a real need to stay competitive and predictive maintenance is essential for manufacturing industry and will gain additional strength in the future. So this is a survey that was done quite some time ago and this was the results that we got back. So we can see the vast majority of key industry players in the manufacturing sector, they consider predictive maintenance to be a very important activity that they want to incorporate into their workflow, right? And we can see here the kind of ROI that we expect on investment in predictive maintenance, so 45% reduction in downtime, 25% growth in productivity, 75% fault elimination, 30% reduction in maintenance cost, okay? And best of all, if you really want to kind of take a look at examples, all right, so there are all these different companies that have significantly invested in predictive maintenance technology in their manufacturing processes. So PepsiCo, we have got Frito-Lay, General Motors, Mondi, Ecoplant, all right? So you can jump over here and take a look at some of these use cases. Let me perhaps, let me try and open this up, for example, Mondi, right? You can see Mondi has impl- oops. Mondi has used this particular piece of software called MATLAB, all right, or MathWorks sorry, to do predictive maintenance for their manufacturing processes using machine learning. And we can talk, you can study how they have used it, all right, and how it works, what was their challenge, all right, the problems they were facing, the solution that they use using this MathWorks Consulting piece of software, and data that they collected in a MATLAB database, all right, sorry in a Oracle database. So using MathWorks from MATLAB, all right, they were able to create a deep learning model to, you know, to solve this particular issue for their domain. So if you're interested, please, I strongly encourage you to read up on all these real life customer stories with showcase use cases for predictive maintenance. Okay, so that's it for real life use cases for predictive maintenance. Now in this topic, I'm going to talk about machine learning basics, so what is actually involved in machine learning, and I'm going to give a very quick, fast, conceptual, high level overview of machine learning, all right? So there are several categories of machine learning, supervised, unsupervised, semi-supervised, reinforcement, and deep learning, okay? And let's talk about the most common and widely used category of machine learning which is called supervised learning. So the particular use case here that I'm going to be discussing, predictive maintenance, it's basically a form of supervised learning. So how does supervised learning work? Well in supervised learning, you're going to create a machine learning model by providing what is called a labelled data set as a input to a machine learning program or algorithm. And this dataset is going to contain what is called an independent or feature variables, all right, so this will be a set of variables. And there will be one dependent or target variable which we also call the label, and the idea is that the independent or the feature variables are the attributes or properties of your data set that influence the dependent or the target variable, okay? So this process that I've just described is called training the machine learning model, and the model is fundamentally a mathematical function that best approximates the relationship between the independent variables and the dependent variable. All right, so that's quite a bit of a mouthful, so let's jump into a diagram that maybe illustrates this more clearly. So let's say you have a dataset here, an Excel spreadsheet, right? And this Excel spreadsheet has a bunch of columns here and a bunch of rows, okay? So these rows here represent observations, or these rows are what we call observations or samples or data points in our data set, okay? So let's assume this data set is gathered by a marketing manager at a mall, at a retail mall, all right? So they've got all this information about the customers who purchase products at this mall, all right? So some of the information they've gotten about the customers are their gender, their age, their income, and the number of children. So all this information about the customers, we call this the independent or the feature variables, all right? And based on all this information about the customer, we also managed to get some or we record the information about how much the customer spends, all right? So this information or these numbers here, we call this the target variable or the dependent variable, right? So on the single row, the data point, one single sample, one single data point, contains all the data for the feature variables and one single value for the label or the target variable, okay? And the primary purpose of the machine learning model is to create a mapping from all your feature variables to your target variable, so somehow there's going to be a function, okay, this will be a mathematical function that maps all the values of your feature variable to the value of your target variable. In other words, this function represents the relationship between your feature variables and your target variable, okay? So this whole thing, this training process, we call this the fitting the model. And the target variable or the label, this thing here, this column here, or the values here, these are critical for providing a context to do the fitting or the training of the model. And once you've got a trained and fitted model, you can then use the model to make an accurate prediction of target values corresponding to new feature values that the model has yet to encounter or yet to see, and this, as I've already said earlier, this is called predictive analytics, okay? So let's see what's actually happening here, you take your training data, all right, so this is this whole bunch of data, this data set here consisting of a thousand rows of data, 10,000 rows of data, you take this entire data set, all right, this entire data set, you jam it into your machine learning algorithm, and a couple of hours later your machine learning algorithm comes up with a model. And the model is essentially a function that maps all your feature variables which is these four columns here, to your target variable which is this one single column here, okay? So once you have the model, you can put in a new data point. So basically the new data point represents data about a new customer, a new customer that you have never seen before. So let's say you've already got information about 10,000 customers that have visited this mall and how much each of these 10,000 customers have spent when they are at this mall. So now you have a totally new customer that comes in the mall, this customer has never come into this mall before, and what we know about this customer is that he is a male, the age is 50, the income is 18, and they have nine children. So now when you take this data and you pump that into your model, your model is going to make a prediction, it's going to say, hey, you know what? Based on everything that I have been trained before and based on the model I've developed, I am going to predict that a customer that is of a male gender, of the age 50 with the income of 18, and nine children, that customer is going to spend 25 ringgit at the mall. And this is it, this is what you want. Right there, right here, can you see here? That is the final output of your machine learning model. It's going to make a prediction about something that it has not ever seen before, okay? That is the core, this is essentially the core of machine learning. Predictive analytics, making prediction about the future based on a historical data set. Okay, so there are two areas of supervised learning, regression and classification. So regression is used to predict a numerical target variable, such as the price of a house or the salary of an employee, whereas classification is used to predict a categorical target variable or class label, okay? So for classification you can have either binary or multiclass, so, for example, binary will be just true or false, zero or one. So whether your machine is going to fail or is it not going to fail, right? So just two classes, two possible, outcomes, or is the customer going to make a purchase or is the customer not going to make a purchase. We call this binary classification. And then for multiclass, when there are more than two classes or types of values. So, for example, here this would be a classification problem. So if you have a data set here, you've got information about your customers, you've got your gender of the customer, the age of the customer, the salary of the customer, and you also have record about whether the customer made a purchase or not, okay? So you can take this data set to train a classification model, and then the classification model can then make a prediction about a new customer, and they're going to predict zero which means the customer didn't make a purchase or one which means the customer make a purchase, right? And regression, this is regression, so let's say you want to predict the wind speed, and you've got historical data about all these four other independent variables or feature variables, so you have recorded temperature, the pressure, the relative humidity, and the wind direction for the past 10 days, 15 days, or whatever, okay? So now you are going to train your machine learning model using this data set, and the target variable column, okay, this column here, the label is basically a number, right? So now with this number, this is a regression model, and so now you can put in a new data point, so a new data point means a new set of values for temperature, pressure, relative humidity, and wind direction, and your machine learning model will then predict the wind speed for that new data point, okay? So that's a regression model. All right. So in this particular topic I'm going to talk about the workflow of that's involved in machine learning. So in the previous slides, I talked about developing the model, all right? But that's just one part of the entire workflow. So in real life when you use machine learning, there's an end-to-end workflow that's involved. So the first thing, of course, is you need to get your data, and then you need to clean your data, and then you need to explore your data. You need to see what's going on in your data set, right? And your data set, real life data sets are not trivial, they are hundreds of rows, thousands of rows, sometimes millions of rows, billions of rows, we're talking about billions or millions of data points especially if you're using an IoT sensor to get data in real time. So you've got all these super large data sets, you need to clean them, and explore them, and then you need to prepare them into a right format so that you can put them into the training process to create your machine learning model, and then subsequently you check how good is the model, right? How accurate is the model in terms of its ability to generate predictions for the future, right? How accurate are the predictions that are coming up from your machine learning model. So that's validating or evaluating your model, and then subsequently if you determine that your model is of adequate accuracy to meet whatever your domain use case requirements are, right? So let's say the accuracy that's required for your domain use case is 85%, okay? If my machine learning model can give an 85% accuracy rate, I think it's good enough, then I'm going to deploy it into real world use case. So here the machine learning model gets deployed on the server, and then other, you know, other data sources are going to be captured from somewhere. That data is pump into the machine learning model. The machine learning model generates predictions, and those predictions are then used to make decisions on the factory floor in real time or in any other particular scenario. And then you constantly monitor and update the model, you get more new data, and then the entire cycle repeats itself. So that's your machine learning workflow, okay, in a nutshell. Here's another example of the same thing maybe in a slightly different format, so, again, you have your data collection and preparation. Here we talk more about the different kinds of algorithms that available to create a model, and I'll talk about this more in detail when we look at the real world example of a end-to-end machine learning workflow for the predictive maintenance use case. So once you have chosen the appropriate algorithm, you then have trained your model, you then have selected the appropriate train model among the multiple models. You are probably going to develop multiple models from multiple algorithms, you're going to evaluate them all, and then you're going to say, hey, you know what? After I've evaluated and tested that, I've chosen the best model, I'm going to deploy the model, all right, so this is for real life production use, okay? Real life sensor data is going to be pumped into my model, my model is going to generate predictions, the predicted data is going to used immediately in real time for real life decision making, and then I'm going to monitor, right, the results. So somebody's using the predictions from my model, if the predictions are lousy, that goes into the monitoring, the monitoring system captures that. If the predictions are fantastic, well that is also captured by the monitoring system, and that gets feedback again to the next cycle of my machine learning pipeline. Okay, so that's the kind of overall view, and here are the kind of key phases of your workflow. So one of the important phases is called EDA, exploratory data analysis and in this particular phase, you're going to do a lot of stuff, primarily just to understand your data set. So like I said, real life data sets, they tend to be very complex, and they tend to have various statistical properties, all right, statistics is a very important component of machine learning. So an EDA helps you to kind of get an overview of your data set, get an overview of any problems in your data set like any data that's missing, the statistical properties of your data set, the distribution of your data set, the statistical correlation of variables in your data set, etc, etc. Okay, then we have data cleaning or sometimes you call it data cleansing, and in this phase what you want to do is primarily, you want to kind of do things like remove duplicate records or rows in your table, you want to make sure that your data or your data points or your samples have appropriate IDs, and most importantly, you want to make sure there's not too many missing values in your data set. So what I mean by missing values are things like that, right? You have got a data set, and for some reason there are some cells or locations in your data set which are missing values, right? And if you have a lot of these missing values, then you've got a poor quality data set, and you're not going to be able to build a good model from this data set. You're not going to be able to train a good machine learning model from a data set with a lot of missing values like this. So you have to figure out whether there are a lot of missing values in your data set, how do you handle them. Another thing that's important in data cleansing is figuring out the outliers in your data set. So outliers are things like this, you know, data points that are very far from the general trend of data points in your data set, right? And so there are also several ways to detect outliers in your data set, and there are several ways to handle outliers in your data set. Similarly as well, there are several ways to handle missing values in your data set. So handling missing values, handling outliers, those are really two very key importance of data cleansing, and there are many, many techniques to handle this, so a data scientist needs to be acquainted with all of this. All right, why do I need to do data cleansing? Well, here is the key point. If you have a very poor quality data set, which means you've got a lot of outliers which are errors in your data set, or you got a lot of missing values in your data set, even though you've got a fantastic algorithm, you've got a fantastic model, the predictions that your model is going to give is absolutely rubbish. It's kind of like taking water and putting water into the tank of a Mercedes-Benz. So Mercedes-Benz is a great car, but if you take water and put it into your Mercedes-Benz, it will just die, right? Your car will just die, it can't run on water, right? On the other hand, if you have a Myvi, Myvi is just a lousy, shit car, but if you take a high octane, good petrol and you put into a Myvi, the Myvi will just go at, you know, 100 miles an hour. It would just completely destroy the Mercedes-Benz in terms of performance, so it doesn't really matter what model you're using here, right? So you can be using the most fantastic model like the Mercedes-Benz or machine learning, but if your data is lousy quality, your predictions is also going to be rubbish, okay? So cleansing data set is, in fact, probably the most important thing that data scientists need to do and that's what they spend most of the time doing, right, building the model, training the model, getting the right algorithms, and so on, that's really a small portion of the actual machine learning workflow, right? The actual machine learning workflow, the vast majority of time is on cleaning and organizing your data. Then you have something called feature engineering which is you preprocess the feature variables of your original data set prior to using them to train the model, and this is either through addition, deletion, combination, or transformation of these variables. And then the idea is you want to improve the predictive accuracy of the model, and also because some models can only work with numeric data, so you need to transform categorical data into numeric data. All right, so just now, in the earlier slides, I showed you that you take your original data set, you pump it into algorithm, and then a couple of hours later, you get a machine learning model, right? So you didn't do anything to your data set, to the feature variables in your data set before you pump it into a machine learning algorithm. So what I showed you earlier is you just take the data set exactly as it is and you just pump it into the algorithm, couple of hours later, you get a model, right? But that's not what generally happens in in real life. In real life, you're going to take all the original feature variables from your data set and you're going to transform them in some way. So you can see here these are the columns of data from my original data set, and before I actually put all these data points from my original data set into my algorithm to train and get my model, I will actually transform them, okay? So the transformation of these feature variable values, we call this feature engineering. And there are many, many techniques to do feature engineering, so one-hot encoding, scaling, log transformation, discretization, date extraction, boolean logic, etc, etc. Okay, then finally we do something called a train-test split, so where we take our original dataset, right? So this was the original dataset, and we break it into two parts, so one is called the training dataset and the other is called the test dataset. And the primary purpose for this is when we feed and train the machine learning model, we're going to use what is called the training dataset, and when we want to evaluate the accuracy of the model, right? So this is the key part of your machine learning life cycle because you are not only just going to have one possible models because there are a vast range of algorithms that you can use to create a model. So fundamentally you have a wide range of choices, right, like wide range of cars, right? You want to buy a car, you can buy a Myvi, you can buy a Perodua, you can buy a Honda, you can buy a Mercedes-Benz, you can buy a Audi, you can buy a beamer, many, many different cars that available for you if you want to buy a car, right? Same thing. With a machine learning model there are a vast variety of algorithms that you can choose from in order to create a model, and so once you create a model from a given algorithm you need to say, hey, how accurate is this model that I've created from this algorithm. And different algorithms are going to create different models with different rates of accuracy. And so the primary purpose of the test dataset is to evaluate the accuracy of the model to see hey, is this model that I've created using this algorithm, is it adequate for me to use in a real life production use case? Okay? So that's what it's all about. Okay, so this is my original dataset, I break it into my feature dataset and also my target variable column, so my feature variable columns, the target variable columns, and then I further break it into a training dataset and a test dataset. The training dataset is to use to train, to create the machine learning model. And then once the machine learning model is created, I then use the test dataset to evaluate the accuracy of the machine learning model. All right. And then finally we can see what are the different parts or aspects that go into a successful model, so EDA about 10%, data cleansing about 20%, feature engineering about 25%, selecting a specific algorithm about 10%, and then training the model from that algorithm about 15%, and then finally evaluating the model, deciding which is the best model with the highest accuracy rate, that's about 20%. All right, so we have reached the most interesting part of this presentation which is the demonstration of an end-to-end machine learning workflow on a real life dataset that demonstrates the use case of predictive maintenance. So for the data set for this particular use case, I've used a data set from Kaggle. So for those of you are not aware of this, Kaggle is the world's largest open-source community for data science and AI, and they have a large collection of datasets from all various areas of industry and human endeavor, and they also have a large collection of models that have been developed using these data sets. So here we have a data set for the particular use case, predictive maintenance, okay? So this is some information about the data set, so in case you do not know how to get to there, this is the URL to click on, okay, to get to that dataset. So once your at the data set here, you can- or the page for about this dataset, you can see all the information about this data set, and you can download the data set in a CSV format. Okay, so let's take a look at the dataset. So this dataset has a total of 10,000 samples, okay? And these are the feature variables, the type, the product ID, the air temperature, process temperature, rotational speed, torque, tool wear, and this is the target variable, all right? So the target variable is what we are interested in, what we are interested in using to train the machine learning model, and also what we are interested to predict, okay? So these are the feature variables, they describe or they provide information about this particular machine on the production line, on the assembly line, so you might know the product ID, the type, the air temperature, process temperature, rotational speed, torque, tool wear, right? So let's say you've got a IoT sensor system that's basically capturing all this data about a product or a machine on your production or assembly line, okay? And you've also captured information about whether is for a specific sample, whether that sample experience a failure or not, okay? So the target value of zero, okay, indicates that there's no failure. So zero means no failure, and we can see that the vast majority of data points in this data set are no failure. And here we can see an example here where you have a case of a failure, so a failure is marked as a one, positive, and no failure is marked as zero, negative, all right? So here we have one type of a failure, it's called a power failure. And if you scroll down the data set, you see there are also other kinds of failures like a tool wear failure, we have a overstrain failure here, for example, we also have a power failure again, and so on. So if you scroll down through these 10,000 data points, or if you're familiar with using Excel to filter out values in a column, you can see that in this particular column here which is the so-called target variable column, you are going to have the vast majority of values as zero which means no failure, and some of the rows or the data points you are going to have a value of one, and for those rows that you have a value of one, for example, here you are- Sorry, for example, here you are going to have different types of failures, so like I said just now power failure, tool set failure, etc, etc. So we are going to go through the entire machine learning workflow process with this dataset. So to see an example of that, we are going to use a- we're going to go to the code section here, all right, so if I click on the code section here. And right down here we have see what is called a dataset notebook. So this is basically a Jupyter notebook. Jupyter is basically an Python application which allows you to create a Python machine learning program that basically builds your machine learning model, assesses or evaluates its accuracy, and generates predictions from it, okay? So here we have a whole bunch of Jupyter notebooks that are available, and you can select any one of them. All these notebooks are essentially going to process the data from this particular dataset. So if I go to this code page here, I've actually selected a specific notebook that I'm going to run through to demonstrate an end-to-end machine learning workflow using various machine learning libraries from the Python programming language, okay? So the particular notebook I'm going to use is this particular notebook here, and you can also get the URL for that particular notebook from here. Okay, so let's quickly do a quick revision again. What are we trying to do here? We're trying to build a machine learning classification model, right? So we said there are two primary areas of supervised learning, one is regression which is used to predict a numerical target variable, and the second kind of supervised learning is classification which is what we're doing here. We're trying to predict a categorical target variable, okay? So in this particular example, we actually have two kinds of ways we can classify, either a binary classification or a multiclass classification. So for binary classification, we are only going to classify the product or machine as either it failed or it did not fail, okay? So if we go back to the dataset that I showed you just now, if you look at this target variable column, there are only two possible values here. They are either zero or one. Zero means there's no failure. One means there's a failure, okay? So this is an example of a binary classification. Only two possible outcomes, zero or one, didn't fail or fail, all right? Two possible outcomes. And then we can also, for the same dataset, we can extend it and make it a multiclass classification problem, all right? So if we kind of want to drill down further, we can say that not only is there a failure, we can actually say there are different types of failures, okay? So we have one category of class that is basically no failure, okay? Then we have a category for the different types of failures, right? So you can have a power failure, you could have a tool wear failure, you could have- let's go down here, you could have a overstrain failure, and etc, etc. So you can have multiple classes of failure in addition to the general overall or the majority class of no failure, and that would be a multiclass classification problem. So with this data set, we are going to see how to make it a binary classification problem and also a multiclass classification problem. Okay, so let's look at the workflow. So let's say we've already got the data, so right now we do have the dataset. This is the dataset that we have, so let's assume we've somehow managed to get this dataset from some IoT sensors that are monitoring real-time data in our production environment. On the assembly line, on the production line we've got sensors reading data that gives us all these data that we have in this CSV file. Okay, so we've already got the data, we've retrieved the data, now we're going to go on to the cleaning and exploration part of your machine learning life cycle. All right, so let's look at the data cleaning part. So the data cleaning part, we're interested in checking for missing values and maybe removing the rows you missing values, okay? So the kind of things we can- sorry, the kind of things we can do in missing values, we can remove the rows missing values, we can put in some new values, some replacement values which could be a average of all the values in that that particular column, etc, etc, we could also try to identify outliers in our data set and also there are a variety of ways to deal with that. So this is called data cleansing which is a really important part of your machine learning workflow, right? So that's where we are now at, we're doing cleansing, and then we're going to follow up with exploration. So let's look at the actual code that does the cleansing here. So here we are right at the start of the machine learning life cycle here, so this is a Jupyter notebook. So here we have a brief description of the problem statement, all right? So this dataset reflects real life predictive maintenance encountered industry with measurements from real equipment. The features description is taken directly from the data source set. So here we have a description of the six key features in our dataset type which is the quality of the product, the air temperature, the process temperature, the rotational speed, the torque, and the tool wear, all right? So these are the six feature variables, and there are the two target variables, so just now- I showed you just now there's one target variable which only has two possible values, either zero or one, okay? Zero or one means failure or no failure, so that will be this column here, right? So let me go all the way back up to here. So this column here, we already saw it only has two possible values, it's either zero or one. And then we also have this column here, and this column here is basically the failure type. And so the- we have- as I already demonstrated just now, we do have several categories of types of failure, and so here we call this multiclass classification. So we can either build a binary classification model for this problem domain, or we can build a multiclass classification problem, all right. So this Jupyter notebook is going to demonstrate both approaches to us. So first step, we are going to write all this Python code that's going to import all the libraries that we need to use, okay? So this is basically Python code, okay, and it's importing the relevant machine learn- oops. We are importing the relevant machine learning libraries related to our domain use case, okay? Then we load in our dataset, okay, so this our dataset. We describe it, we have some quick insights into the dataset. And then we just take a look at all the variables of the feature variables, etc, and so on. What we're doing now is just doing a quick overview of the dataset, so this all this Python code here that we're writing is allowing us, the data scientist, to get a quick overview of our dataset, right, okay, like how many varia- how many rows are there, how many columns are there, what are the data types of the columns, what are the name of the columns, etc, etc. Okay, then we zoom in on to the target variables. So we look at the target variables, how many counts there are of this target variable, and so on. How many different types of failures there are. Then you want to check whether there are any inconsistencies between the target and the failure type, etc. Okay, so when you do all this checking, you're going to discover there are some discrepancies in your dataset, so using a specific Python code to do checking, you're going to say hey, you know what? There's some errors here, right? There are nine values that classify as failure in target variable, but as no failure in the failure type variable, so that means there's a discrepancy in your data point, right? So these are all the ones that are discrepancies because the target variable says one, and we already know that target variable one is supposed to mean there is a failure, right? Target variable one is supposed to mean there is a failure, so we are kind of expecting to see the failure classification, but some rows actually say there's no failure although the target type is one. Well here is a classic example of an error that can very well occur in a dataset, so now the question is what do you do with these errors in your dataset, right? So here the data scientist says, I think it would make sense to remove those instances, and so they write some code then to remove those instances or those rows or data points from the overall data set, and same thing we can, again, check for other issues. So we find there's another issue here with our data set which is another warning, so, again, we can possibly remove them. So you're going to remove 27 instances or rows from your overall data set. So your data set has 10,000 rows or data points. You're removing 27 which is only 0.27 of the entire dataset. And these were the reasons why you removed them, okay? So if you're just removing 0.27% of the entire dataset, no big deal, right? Still okay, but you needed to remove them because these errors right, these 27 errors, okay, data points with errors in your dataset could really affect the training of your machine learning model. So we need to do your data cleansing, right? So we are actually cleansing now some kind of data that is incorrect or erroneous in your original dataset. Okay, so then we go on to the next part which is called EDA, right? So EDA is where we kind of explore our data, and we want to, kind of, get a visual overview of our data as a whole, and also take a look at the statistical properties of our data. The statistical distribution of the data in all the various columns, the correlation between the variables, between the feature variables different columns, and also the feature variable and the target variable. So all of this is called EDA, and EDA in a machine learning workflow is typically done through visualization, all right? So let's go back here and take a look, right? So, for example, here we are looking at correlation, so we plot the values of all the various feature variables against each other and look for potential correlations and patterns and so on. And all the different shapes that you see here in this pair plot, okay, will have different meaning, statistical meaning, and so the data scientist has to, kind of, visually inspect this pair plot, make some interpretations of these different patterns that he sees here, all right. So these are some of the insights that can be deduced from looking at these patterns, so, for example, the torque and rotational speed are highly correlated, the process temperature and air temperature also highly correlated, that failures occur for extreme values of some features, etc, etc. Then you can plot certain kinds of charts. This called a violin chart to, again, get new insights. For example, regarding the torque and rotational speed, it can see, again, that most failures are triggered for much lower or much higher values than the mean when they're not failing. So all these visualizations, they are there, and a trained data scientist can look at them, inspect them, and make some kind of insightful deductions from them, okay? Percentage of failure, right? The correlation heat map, okay, between all these different feature variables, and also the target variable, okay? The product types, percentage of product types, percentage of failure with respect to the product type, so we can also kind of visualize that as well. So certain products have a higher ratio of failure compared to other product types, etc. Or, for example, M tends to fail more than H products, etc, etc. So we can create a vast variety of visualizations in the EDA stage, so you can see here. And, again, the idea of this visualization is just to give us some insight, some preliminary insight into our dataset that helps us to model it more correctly. So some more insights that we get into our data set from all this visualization. Then we can plot the distribution so we can see whether it's a normal distribution or some other kind of distribution. We can have a box plot to see whether there are any outliers in your data set and so on, right? So we can see from the box plots, we can see rotational speed and have outliers. So we already saw outliers are basically a problem that you may need to kind of tackle, right? So outliers are an issue, it's a part of data cleansing. And so you may need to tackle this, so we may have to check okay, well where are the potential outliers so we can analyze them from the box plot, okay? But then we can say well they are outliers, but maybe they're not really horrible outliers so we can tolerate them or maybe we want to remove them. So we can see what our mean and maximum values for all these with respect to product type, how many of them are above or highly correlated with the product type in terms of the maximum and minimum, okay, and then so on. So the insight is well we got 4.8% of the instances are outliers, so maybe 4.87% is not really that much, the outliers are not horrible, so we just leave them in the dataset. Now for a different dataset, the data scientist could come to a different conclusion, so then they would do whatever they've deemed is appropriate to, kind of, cleanse the dataset. Okay, so now that we have done all the EDA, the next thing we're going to do is we are going to do what is called feature engineering. So we are going to transform our original feature variables and these are our original feature variables, right? These are our original feature variables, and we are going to transform them, all right? We're going to transform them in some sense into some other form before we fit this for training into our machine learning algorithm, all right? So these are examples of- let's say these are examples of a original data set, right? And this is examples, these are some of the examples, you don't have to use all of them, but these are some of the examples of what we call feature engineering which you can then transform your original values in your feature variables to all these transform values here. So we're going to pretty much do that here, so we have a ordinal encoding, we do scaling of the data so the dataset is scaled, we use a MinMax scaling, and then finally, we come to do a modeling. So we have to split our dataset into a training dataset and a test dataset. So coming back to here again, we said that before you train your model, sorry, before you train your model, you have to take your original dataset, now this is a featured engineered dataset. We're going to break it into two or more subsets, okay. So one is called the training dataset that we use to feed and train a machine learning model. The second is test dataset to evaluate the accuracy of the model, okay? So we got this training dataset, your test dataset, and we also need to sample. So from our original data set we need to sample some points that go into your training dataset, some points that go in your test dataset. So there are many ways to do sampling. One way is to do stratified sampling where we ensure the same proportion of data from each stata or class because right now we have a multiclass classification problem, so you want to make sure the same proportion of data from each strata or class is equally proportional in the training and test dataset as the original dataset which is very useful for dealing with what is called an imbalanced dataset. So here we have an example of what is called an imbalanced dataset in the sense that you have the vast majority of data points in your data set, they are going to have the value of zero for their target variable column. So only a extremely small minority of the data points in your dataset will actually have the value of one for their target variable column, okay? So a situation where you have your class or your target variable column where the vast majority of values are from one class and a tiny small minority are from another class, we call this an imbalanced dataset. And for an imbalanced dataset, typically we will have a specific technique to do the train test split which is called stratified sampling, and so that's what's exactly happening here. We're doing a stratified split here, so we are doing a train test split here, and we are doing a stratified split. And then now we actually develop the models. So now we've got the train test split, now here is where we actually train the models. Now in terms of classification there are a whole bunch of possibilities, right, that you can use. There are many, many different algorithms that we can use to create a classification model. So these are an example of some of the more common ones. Logistic, support vector machine, decision trees, random forest, bagging, balanced bagging, boost, ensemble. So all these are different algorithms which will create different kinds of models which will result in different accuracy measures, okay? So it's the goal of the data scientist to find the best model that gives the best accuracy for the given dataset, for training on that given dataset. So let's head back, again, to our machine learning workflow. So here basically what I'm doing is I'm creating a whole bunch of models here, all right? So one is a random forest, one is balanced bagging, one is a boost classifier, one's a ensemble classifier, and using all of these, I am going to basically feed or train my model using all these algorithms. And then I'm going to evaluate them, okay? I'm going to evaluate how good each of these models are. And here you can see your evaluation data, right? Okay and this is the confusion matrix which is another way of evaluating. So now we come to the, kind of, the key part here which is how do I distinguish between all these models, right? I've got all these different models which are built with different algorithms which I'm using to train on the same dataset, how do I distinguish between all these models, okay? And so for that sense, for that we actually have a whole bunch of common evaluation metrics for classification, right? So this evaluation metrics tell us how good a model is in terms of its accuracy in classification. So in terms of accuracy, we actually have many different models, sorry, many different measures, right? You might think well, accuracy is just accuracy, well that's all right, it's just either it's accurate or it's not accurate, right? But actually it's not that simple. There are many different ways to measure the accuracy of a classification model, and these are some of the more common ones. So, for example, the confusion matrix tells us how many true positives, that means the value is positive, the prediction is positive, how many false positives which means the value is negative the machine learning model predicts positive. How many false negatives which means that the machine learning model predicts negative, but it's actually positive. And how many true negatives there are which means that the the machine learning model predicts negative and the true value is also negative. So this is called a confusion matrix. This is one way we assess or evaluate the performance of a classification model, okay? This is for binary classification, we can also have multiclass confusion matrix, and then we can also measure things like accuracy. So accuracy is the true positives plus the true negatives which is the total number of correct predictions made by the model divided by the total number of data points in your dataset. And then you have also other kinds of measures such as recall. And this a formula for recall, this is a formula for the F1 score, okay? And then there's something called the ROC curve, right? So without going too much in the detail of what each of these entails, essentially these are all different ways, these are different KPI, right? Just like if you work in a company, you have different KPI, right? Certain employees have certain KPI that measures how good or how, you know, efficient or how effective a particular employee is, right? So the KPI for your machine learning models are ROC curve, F1 score, recall, accuracy, okay, and your confusion matrix. So fundamentally after I have built, right, so here I've built my four different models. So after I built these four different models, I'm going to check and evaluate them using all those different metrics like, for example, the F1 score, the precision score, the recall score, all right. So for this model, I can check out the ROC score, the F1 score, the precision score, the recall score. Then for this model, this is the ROC score, the F1 score, the precision score, the recall score. Then for this model and so on. So for every single model I've created using my training data set, I will have all my set of evaluation metrics that I can use to evaluate how good this model is, okay? Same thing here, I've got a confusion matrix here, right, so I can use that, again, to evaluate between all these four different models, and then I, kind of, summarize it up here. So we can see from this summary here that actually the top two models, right, which are I'm going to give a lot, as a data scientist, I'm now going to just focus on these two models. So these two models are bagging classifier and random forest classifier. They have the highest values of F1 score, and the highest values of the ROC curve score, okay? So we can say these are the top two models in terms of accuracy, okay, using the F1 evaluation metric and the ROC AUC evaluation metric, okay? So these results, kind of, summarize here, and then we use different sampling techniques, okay, so just now I talked about different kinds of sampling techniques, and so the idea of different kinds of sampling techniques is to just get a different feel for different distributions of the data in different areas of your dataset, so that you want to just, kind of, make sure that your your evaluation of accuracy is actually statistically correct, right? So we can do what is called oversampling and under sampling which is very useful when you're working with an imbalanced data set. So this is a example of doing that, and then here we, again, check out the results for all these different techniques we use. The F1 score, the AUC score, all right, these are the two key measures of accuracy, right? So and then we can check out the scores for the different approaches. Okay so we can see, oh well, overall the models have lower ROC AUC score, but they have a much higher F1 score. The bagging classifier had the highest ROC AUC score, but F1 score was too low, okay. Then, in the data scientist opinion, the random forest with this particular technique of sampling has an equilibrium between the F1 ROC, and AUC score. So the takeaway one is the macro F1 score improves dramatically using these sampling techniques, so these models might be better compared to the balanced ones, all right. So based on all this evaluation, the data scientist says they're going to continue to work with these two models, all right, and the balanced bagging one, and then continue to make further comparisons, all right. So then, we continue to keep refining on our evaluation work here. We're going to train the models one more time again, so we, again, do a training test split, and then we do that for this particular approach model. And then we print out what is called a classification report, and this is basically a summary of all those metrics that I talk about just now, so, just now, remember I said there was several evaluation metrics, right? So we had the confusion matrix, the accuracy, the precision, the recall, the AUC ROC score. So here with the classification report, I can get a summary of all of that, so I can see all the values here, okay, for this particular model, bagging tomek links. And then, I can do that for another model, the random forest borderline SMOTE, and then I can do that for another model which is the balanced bagging. So, again, we see this a lot of comparison between different models trying to figure out what all these evaluation metrics are telling us, all right? Then, again, we have a confusion matrix. So we generate a confusion matrix for the bagging with the tomeks links undersampling, for the random forest with the borderline SMOTE oversampling, and just balanced bagging by itself. Then, again, we compare between these three models using the confusion matrix, evaluation matrix, and then we can kind of come to some conclusions. All right, so, right, so now we look at all the data, then we move on and look at another another kind of evaluation metrics which is the ROC score, right? So this is one of the other evaluation metrics I talk about. So this one is a kind of a curve, you look at it to see the area underneath the curve, this is called AOC ROC area under the curve, sorry, AUC ROC area under the curve. All right, so the area under the curve score will give us some idea about the threshold that we're going to use for classification, so we can examine this for the bagging classifier, for the random forest classifier, for the balanced bagging classifier, okay? Then we can also, again, do that- finally we can check the classification report of this particular model. So we keep doing this over and over again, evaluating this the matrix, the accuracy matrix, the evaluation matrix for all these different models. So we keep doing this over and over again for different thresholds or for classification, and so as we keep drilling into these, we kind of get more and more understanding of all these different models, which one is the best one that gives the best performance for our dataset, okay? So finally, we come to this conclusion, this particular model is not able to reduce the recall on failures less than 95.18%. On the other hand, balanced begging with a decision thresold of 0.6 is able to have a better recall blah, blah, blah, etc. So finally, after having done all of this evaluations, okay, this is the conclusion. So after having gone- so right now we have gone through all the steps of the machine learning life cycle which means we have right now, or the data scientist right now has gone through all these steps which is now we have done this validation. So we have done the cleaning, exploration, preparation, transformation, the feature engineering, we have developed and trained multiple models, we have evaluated all these different models, so right now we have reached this stage, so at this stage we as the data scientist, kind of, have completed our job. So we've come to some very useful conclusions which we now can share with our colleagues, all right? And based on these conclusions or recommendations, somebody is going to choose a appropriate model, and that model is going to get deployed for real-time use in a real life production environment, okay? And that decision is going to be made based on the recommendations coming from the data scientist at the end of this phase, okay? So at the end of this phase, the data scientist is going to come up with these conclusions. So conclusions is, okay, if the engineering team they are looking, okay? The engineering team, right? The engineering team, if they are looking for the highest failure detection rate possible, then they should go with this particular model, okay? And if they want a balance between precision and recall, then they should choose between the bagging model with a 0.4 decision threshold or the random forest model with a 0.5 threshold, but if they don't care so much about predicting every failure, and they want the highest precision possible, then they should opt for the bagging tomek links classifier with a bit higher decision threshold. And so this is the key thing that the data scientist is going to give, right? This is the key takeaway. This is the, kind of, the end result of the entire machine learning life cycle. Right now the data scientist is going to tell the engineering team, all right you guys, which is more important for you, point A, point B, or point C. Make your decision. So the engineering team will then discuss among themselves and say, hey you know what? What we want is we want to get the highest failure detection possible because any kind of failure of that machine or the product or the assembly line is really going to screw us up big time. So what we're looking for is the model that will give us the highest failure detection rate. We don't care about precision, but we want to be make sure that if there's a failure, we are going to catch it, right? So that's what they want, and so the data scientist will say, hey you go for the balanced bagging model, okay? Then, the data scientist saves this, all right. And then, once you have saved this, you can then go right ahead and deploy that. So you can go right ahead and deploy that to production. Okay, and so if you want to continue, we can actually further continue this modeling problem. So just now, I model this problem as a binary classification problem. Uh, sorry. I modeled this problem as a binary classification which means it's either zero or one, either fail or not fail, but we can also model it as a multiclass classification problem, right, because as I said earlier just now for the target variable column which is- sorry, for the failure type column, you actually have multiple kinds of failures, right? For example, you may have a power failure, you may have a tool wear failure, you may have a overstrain failure. So now we can model the problem slightly differently, so we can model it as a multiclass classification problem, and then we go through the entire same process that we went through just now, so we create different models, we test this out, but now the confusion matrix is for a multiclass classification issue, right? So we're going to check them out. We're going to, again, try different algorithms or models. Again, train and test our dataset, do the training test split on these different models. All right, so we have like, for example, we have balanced random forest, balanced random forest grid search, then you train the models using what is called hyperparameter tuning, then you get the scores. All right, so you get the same evaluation scores again. You check out the evaluation scores, compare between them, generate a confusion matrix, so this is a multiclass confusion matrix. And then, you come to the final conclusion. So now if you are interested to frame your problem domain as a multiclass classification problem, all right, then these are the recommendations from the data scientist. So the data scientist will say, you know what, I'm going to pick this particular model, the balanced bagging classifier, and these are all the reasons that the data scientist is going to give as a rational for selecting this particular model. And then once that's done, you save the model and that's it, that's it. So that's all done now, and so then the the model, the machine learning model, now you can put it live, run it on the server, and now the machine learning model is ready to work which means it's ready to generate predictions, right? That's the main job of the machine learning model. You have picked the best machine learning model with the best evaluation metrics for whatever accuracy goal you're trying to achieve. And now you're going to run it on a server, and now you're going to get all this real-time data that's coming from your sensors, you're going to pump that into your machine learning model, your machine learning model will pump out a whole bunch of predictions, and we're going to use that predictions in real-time to make real-time, real-world decision making, right? You're going to say, okay I'm predicting that that machine is going to fail on Thursday at 5:00 p.m., so you better get your service folks in to service it on Thursday 2 p.m. or, you know, whatever. So you can, you know, make decisions on when you want to do your maintenance, you know, and make the best decisions to optimize the cost of maintenance, etc, etc. And then based on the results that are coming up from the predictions, so the predictions may be good, the predictions may be lousy, the predictions may be average, right? So we're constantly monitoring how good or how useful are the predictions generated by this real-time model that's running on the server, and based on our monitoring, we will then take some new data and then repeat this entire life cycle again, so this is basically a workflow that's iterative, and we are constantly or the data scientist is constantly getting in all these new data points and then refining the model, picking maybe a new model, deploying the new model onto the server, and so on. All right, and so that's it. So that is basically your machine learning workflow in a nutshell. Okay so for this particular approach we have used a bunch of data science libraries from Python, so we have used Pandas which is the most basic data science libraries that provides all the tools to work with raw data. We have used Numpy which is a high performance library for implementing complex array matrix operations. We have used Matplotlib and Seaborn which is used for doing the EDA the exploratory data analysis phase of machine learning where you visualize all your data. We have used Scikit learn which is the machine learning library to do all your implementation for all your core machine learning algorithms. We have not used this because this is not a deep learning problem, but if you are working with a deep learning problem like image classification, image recognition, object detection, okay, natural language processing, text classification, well then you're going to use these libraries from Python which is Tensorflow, okay, and also Pytorch. And then lastly, that whole thing, that whole data science project that you saw just now, this entire data science project is actually developed in something called a Jupyter notebook. So all this Python code along with all the observations from the data scientists, okay, for this entire data science project was actually run in something called a Jupyter notebook. So that is the most widely used tool for interactively developing and presenting data science projects. Okay so that brings me to the end of this entire presentation. I hope that you find it useful for you, and that you can appreciate the importance of machine learning, and how it can be applied in a real life use case in a typical production environment. All right, thank you all so much for watching.