Hello everyone, my name is Victor. I'm your friendly neighborhood data scientist from DreamCatcher. So in this presentation, I would like to talk about a specific industry use case of AI or machine learning which is predictive maintenance. So I will be covering these topics and feel free to jump forward to the specific part in the video where I talk about all these topics. So I'm going to start off with a general preview of AI and machine learning. Then, I'll discuss the use case which is predictive maintenance. I'll talk about the basics of machine learning, the workflow of machine learning, and then we will come to the meat of this presentation which is essentially a demonstration of the machine learning workflow from end to end on a real life predictive maintenance domain problem. All right, so without any further ado, let's jump into it. So let's start off with a quick preview of AI and machine learning. Well AI is a very general term, it encompasses the entire area of science and engineering that is related to creating software programs and machines that will be capable of performing tasks that would normally require human intelligence. But AI is a catchall term, so really when we talk about apply AI, how we use AI in our daily work, we are really going to be talking about machine learning. So machine learning is the design and application of software algorithms that are capable of learning on their own without any explicit human intervention. And the primary purpose of these algorithms are to optimize performance in a specific task. And the primary performance or the primary task that you want to optimize performance in is to be able to make accurate predictions about future outcomes based on the analysis of historical data from the past. So essentially machine learning is about making predictions about the future or what we call predictive analytics. And there are many different kinds of algorithms that are available in machine learning under the three primary categories of supervised learning, unsupervised learning, and reinforcement learning. And here we can see some of the different kinds of algorithms and their use cases in various areas in industry. So we have various domain use cases for all these different kind of algorithms, and we can see that different algorithms are fitted for different use cases. Deep learning is an advanced form of machine learning that's based on something called an artificial neural network or ANN for short, and this essentially simulates the structure of the human brain whereby neurons interconnect and work together to process and learn new information. So DL is the foundational technology for most of the popular AI tools that you probably have heard of today. So I'm sure you have heard of ChatGPT if you haven't been living in a cave for the past 2 years. And yeah, so ChatGPT is an example of what we call a large language model and that's based on this technology called deep learning. Also, all the modern computer vision applications where a computer program can classify images or detect images or recognize images on its own, okay, we call this computer vision applications. They also use this particular form of machine learning called deep learning, right? So this is a example of an artificial neural network. For example, here I have an image of a bird that's fed into this artificial neural network, and output from this artificial neural network is a classification of this image into one of these three potential categories. So in this case if the a Ann has been trained properly uh we fit in this image this a&n should correctly classify this image as a bird right so this is image classification problem which is a classic use case for an artificial neural network in the field of computer vision and and just like in the case of machine learning there are a variety of algorithms uh that are available for deep learning under the category of supervised learning and also unsupervised learning all right so this is how we can kind of categorize this you can think of AI is a general area of Smart Systems and machine machine learning is basically applied Ai and deep learning is a subspecialization of machine learning using a particular architecture called an artificial neural network and generative AI so if you talk about chat GPT Okay Google Gemini Microsoft co-pilot okay all these examples of generative AI they are basically large language models and they are a further subcategory within the area of deep learning and there are many applications of machine learning in Industry right now so pick which particular industry you involved in and these are all the specific areas of applications right uh so probably I'm going to guess the vast majority of you who are watching this video you're probably coming from the manufacturing industry and so in the manufacturing industry some of the standard use cases for machine learning and deep learning are predicting potential problems okay so sometimes you call this uh predictive maintenance where you want to predict when a problem is going to happen and then kind of address it before it happens and then monitoring systems automating your manufacturing assembly line or production line okay smart scheduling and detecting anomaly on your production line okay so let's talk about the use case here which is predictive maintenance right so what is predictive maintenance well predictive maintenance uh here's the long definition is a equipment maintenance strategy that relies on real-time monitoring of equipment conditions and data to predict equipment failures in advance and this uses Advanced Data models analytics and machine learning whereby we can reliably assess when failures are more likely to occur including which components are more likely to be affected on your production or assembly line so where does pred predictive maintenance fit into the overall scheme of things right so let's talk about the kind of standard way that you know factories or production uh or production lines assembly lines in factories tend to handle uh Main maintenance issues say 10 or 20 years ago right so what you have is the what you would probably start off is is the most basic mode which is reactive maintenance so you just wait until your machine breaks down and then you repair right the simplest but of course I'm sure if you work on a production line for any period of time you know that this reactive maintenance can give you a whole bunch of headaches especially if the machine breaks down just before a critical delivery dat line right then you're you're going to have a backlog of orders and you're going to run a lot of problems okay so we move on to PR preventive maintenance which is you regularly schedule a maintenance of your production machines uh to reduce the failure rate so you might do maintenance once every month once every two weeks whatever okay this is great but the problem of course then is well sometimes you're doing too much maintenance it's not really necessary and it still doesn't totally uh prevent this uh you know uh a failure of the machine that occurs outside of your plan maintenance right so a bit of improvement but not not that much better and then these last two categories is where we bring in Ai and machine learning so with machine learning we're going to use sensors to do real-time monitoring of the data and then using that data we're going to build a machine learning model which helps us to predict with a reasonable level of accuracy when the next failure is going to happen on your assembly or production line on a specific component or specific machine right so you just want to be predict to to a high level of accuracy like maybe to the specific day even the specific hour or even minute itself when you expect that particular product to fail or the particular machine to fail all right so these are advantages of predictive maintenance it minimizes occurrence of unscheduled downtime it gives you a realtime overview of your current condition of assets ensures minimal disruptions of productivity uh optimizes time span on maintenance work optimizes the use of spare parts and so on and of course there are some disadvantages with which is uh the primary one you need a specialized set of skills among your engineers to understand and create machine learning models that can work on the realtime data that you're getting okay so we're going to take a look at some real life use cases so these are a bunch of links here so if you navigate to these links here you'll be able to get a look at some real life use cases of um machine learning uh in predictive maintenance so the IBM website okay gives you a look at a bunch of five use cases so you can click on these links and follow up with them if you want to read more okay this is Waste Management manufacturing okay Building Services and renewable energy and also mining right so these are all use cases if you want to know more about them you can read up and follow them from this website uh and this website gives uh this is a pretty good website I would really encourage you to just look through this if you're interested in predictive maintenance so here it tells you about you know an industry survey of predictive maintenance we can see that a large portion of the industry manufacturing industry agreed that predictive maintenance is a real need to stay competitive uh and predictive maintenance is essential for manufacturing industry and will gain additional strength in the future so this is a survey that was done um quite some time ago and this was the results that we got back so we can see the vast majority of key industry players in the manufacturing sector they consider predictive maintenance to be very important um activity that they want to incorporate into their workflow right and we can see here the kind of Roi that we expect on investment in predictive maintenance so 45% reduction in downtime 25% growth in productivity 75% fault elimination 30% reduction in maintenance cost okay and best of all if you really want to kind of take a look at examples all right so there are all these different companies that have uh significantly invested in predictive maintenance technology in their manufacturing processes So pepsic Co we have got uh Frito General motos Mii EOP plan all right so you can jump over here and and take a look at some of these uh use cases let me perhaps let me try and open this up for example Mii right you can see Mii has impl oops Mii has used uh this particular piece of software called ma lab all right or math work sorry uh to do uh predictive maintenance for their manufacturing processes using machine learning and we can talk you can study how they have used it all right and how it works what was their challenge all right the problems that were facing the solution that they use using this MathWorks Consulting piece of software and data that they collected in a uh matlb database all right uh sorry in a Oracle databas uh Oracle database so using math works from math lab all right they were able to create a deep learning model to to to you know to solve this particular issue for their domain so if you're interested please I strongly encourage you to read up on all these real life customer Stories We showcase uh use cases for predictive maintenance okay so that's it for uh real life use cases for predictive maintenance now in this uh topic I'm I'm going to talk about machine learning Basics so what is is actually involved in machine learning and I'm going to give a very quick fast conceptual high level overview of machine learning all right so there are several categories of machine learning supervised unsupervised semi-supervised reinforcement and deep learning okay and let's talk about the most common and widely used category of machine learning which is called supervised learning so the par um use case here that I'm going to be discussing predictive maintenance it's basically a form of supervised learning so how does supervised learning work well in supervised learning you're going to create a machine learning model by providing what is called a labeled data set as a input to a machine learning program or algorithm and this data set is going to contain what is called an independent of feature variables all right so this will be a set of variables and there will be one dependent or Target variable which we also call the label and the idea is that the independent or the feature variable are the attributes or properties of your data set that influence the dependent or the target variable okay so this process that I've just described is called training the machine learning model and the model is fundamentally a mathematical function that best approximates the relationship between the independent variables and the dependent variable all right so there's quite a bit of a mouthful so let's jump into a diagram that maybe illustrates this more clearly so let's say you have a data set here an Excel spreadsheet right and this Excel spreadsheet has a bunch of columns here and a bunch of rows okay so these rows here represent observations or or these rows are what we call observations or samples or data points in our data set okay so let's assume this data set is uh gathered by a marketing manager at a mall at a retail mall all right so they've got all this uh information about the customers who purchase products at this mall all right so some of the information they've gotten about the customers are their gender their age their income and the number of children so all this information about the customers we call this the independent of the feature variables all right and based on all this information about the customer we also managed to get some or we record the information about how much the customer spends all right so this uh information or this numbers here we call this the target variable or the dependent variable right so on the single Row the data point one single one single data point contains all the data for the feature variables and one single value for the label or the target variable okay and the primary purpose of the machine learning model is to create a mapping from all your feature variables to your target variable so somehow there's going to be a function okay this will be a mathematical function that maps all the values of your feature variable to the value of your target variable in other words this function represents the relationship between your future variables and your target variable okay so this whole thing this training process we call this the fitting the model and the target variable or the label this thing here this colume here or the values here these are critical for providing a context to do the fitting of the training of the model and once you've got a trained and fitted model you can then use the model to make an accurate prediction of Target values corresponding to new feature values that the model has yet to encounter or yet to see and this as I've already said earlier this is called Predictive Analytics okay so let's see what's actually happening here you take your training data all right so this is this whole bunch of data this data set here cons consisting of a thousand rows of data 10,000 rows of data you take this entire data set all right this entire data set you jam it into your machine learning algorithm and a couple of hours later your machine learning algorithm comes out with a model and the model is essentially a function that maps all your feature variables which is these four columns here to your target variable which is this one single colume here okay so once you have the model you can put in a new data point so basically the new data point represents data about new customer a new customer that you have never seen before so let's say you've already got information about 10,000 customers that have visited this mall and how much each of these 10,000 customers have spent when they at this mall so now you have a totally new customer that comes in the mall this customer has never come into this mall before and what we know about this customer is that he is a male the age is 50 the income is 18 and they have nine children so now when you take this data and you pump that into your model your model is going to make a prediction it's going to say hey you know what based on everything that have been trained before and based on the model I've developed I am going to predict that a customer that is of a male gender of the age 50 with the income of 18 and nine children that customer is going to spend 25 ring at the mall and this is it this is what you want right right there right here can you see here that is the final output of your machine learning model it's going to make a prediction about something that it has not ever seen before okay that is the core this is essentially the core of machine learning Predictive Analytics making prediction about the future based on a historical data set okay so there are two areas of supervised learning regression and classification so regression is used to predict a numerical Target variable such as the price of a house or the salary of an employee whereas classification is used to predict a c categorical Target variable or class label okay so for classification you can have either binary or multiclass so for example binary will be just true or false zero or one so whether your machine is going to fail or is it not going to fail right so just two classes two possible outcomes or is the customer going to make a purchase or is the customer not going to make a purchase uh we call this binary classification and then for multiclass when there are more than two classes or types of values so for example here this would be a classification problem so if you have a data set here you've got information about your customers you've got your gender of the customer the age of the customer the salary of the customer and you also have record about whether the customer made a purchase or not okay so you can take this data set to train a classification model and then the classification model can then make a prediction about a new customer and they're going to predict zero which means the customer didn't make a purchase or one which means the customer make a purchase right and regression this is regression so let's say you want to predict the wind speed and you've got historical data about all these four other independent variables or feature variables so you have recorded temperature the pressure the relative humidity and the wind direction for the past 10 days 15 days or whatever okay so now you are going to train your machine learning model using this data set and and the target variable colume okay this colume here the label is basically a number right so now with this number this is a regression model and so now you can put in a new data point so a new data point means a new set of values for temperature pressure relative humidity and wind direction and your machine learning model will then predict the wind speed for that new data point okay so that's a regression model all right so in this particular topic I'm going to talk about the workflow of that's involved in machine learning so in the previous um slides I talked about developing the model all right but that's just one part of the entire workflow so in real life when you use machine learning there's an endtoend workflow that's involved so the first thing of course is you need to get your data and then you need to clean your data and then you need to exploore your data you need to see what's going on in your data set right and your data set real life data sets are not trivial they are hundreds of rows thousands of rows sometimes millions of rows billions of rows we're talking about billions or millions of data points especially if you're using an iot sensor to get data in real time so you've got all these super large data sets you need to clean them and explore them and then you need to prepare them into a right format so that you can put them into the training process to create your machine learning model and then subsequently you check how good is the model right how accurate is the model in terms of its ability to generate predictions or or for the future right how accurate are the predictions that are coming up from your machine learning model so that's validating or evaluating your model and then subsequently if you determine that your model is of adequate accuracy to meet whatever your domain use case requirements are right so let's say the accuracy that's required for your domain use case is 85% okay if my machine learning model can give an 85% accuracy rate I think it's good enough then I'm going to deploy it into rail world use case so here the machine learning model gets uh deployed on the server and then um other you know other data sources are going to be captured from somewhere that data is pump into the machine learning model the machine learning model generates predictions and those predictions are then used to make decisions on the factory floor in real time or in any other particular scenario and then you constantly Monitor and update the model you get more new data and then the entire cycle repeats itself so that's your machine learning workflow okay in a nutshell uh here's another example of this the same thing maybe in a slightly different format so again you have your data collection and preparation here we talk more about the different kinds of algorithms that available to create a model and I'll talk about this more in detail when we look at the real world example of a endtoend machine learning workflow for the predictive maintenance use case so once you have chosen the appropriate algorithm you then have trained your model you then have selected the appropriate train model among the multiple models you you are probably going to develop multiple models from multiple algorithms you're going to evaluate them all and then you're going to say hey you know what after I've evaluated and tested that I've chosen the best model I'm going to deploy the model all right so this is for Real Life production use okay real life sensor data is going to be pumped into my model my model is going to generate predictions the predicted data is going to used immediately in real time for real life decision making and then I'm going to monitor right the results so somebody's using the predictions from my model if the predictions are lousy that goes into the monitoring the monitoring system captures that if the predictions are fantastic well that also captured by the monitoring system system and that gets feedback again to the next cycle of my machine learning pipeline okay so that's the kind of overall View and here are the kind of key phases of your workflow so one of the important phases is called Eda exploratory data analysis and in this particular uh phase uh you're going to do a lot of stuff primarily just to understand your data set so like I said real life data sets they tend to be very complex and they tend to have various statistical properties all right statistics is a very important component of machine learning so an Eda helps you to kind of get an overview of your data set get an overview of any problems in your data set like any data that's missing the statistical properties your data set the distribution of your data set the statistical correlation of variables in your data set etc etc okay then we have data cleaning or sometimes you call it data cleansing and in this phase what you want to do is primarily you want to kind of do things like remove duplicate records or rows in your table you want to make sure that there I your your data or your data points your samples have appropriate IDs and most importantly you want to make sure there's not too many missing values in your data set so what I mean by missing values are things like that right you have got a data set and for some reason there are some cells or locations in your data set which are missing values right and if you have a lot of these missing values then you've got a poor quality data set and you're not going to be able to build a good model from this data set you're not going to be able to train a good machine learning model from a data set with a lot of missing values like this so you have to figure out whether there are a lot of missing values in your data set how do you handle them another thing that's important in data cleansing is figuring out the outliers in your data set so uh outliers are things like this you know data points are very far from the general trend of data points in your data set right and and so there are also several ways to detect outliers in your data set and there are several ways to handle outliers in your data set similarly as well there are several ways to handle missing values in your data set so handling missing values handling outliers those are really two very key importance of data cleansing and there are many many techniques to handle this so a data scientist needs to be acquainted with all of this all right why do I need to do data cleansing well here is the key point if you have a very poor quality data set which means youve got a lot of outliers which are errors in your data set or you got a lot of missing values in your data set even though youve got a fantastic algorithm you've got a fantastic model the predictions that your model is going to give is absolutely rubbish it's kind of like taking water and putting water into the tank of a mercedesbenz so Mercedes-Benz is a great car but if you take water and put it into your mercedes-ben it will just die right your car will just die can't run on on water right on the other hand if you have a myv myv is just a lousy car but if you take a high octane good Patrol and you point to a MV the MV will just go at you know 100 Mil hour it which just completely destroy the Mercedes-Benz in terms of performance so it doesn't it doesn't really matter what model you're using right so you can be using the most Fantastic Model like the the mercedesbenz or machine learning but if your data is lousy quality your predictions is also going to be rubbish okay so cleansing data set is in fact probably the most important thing that data scientists need to do and that's what they spend most of the time doing right building the model trading the model getting the right algorithms and so on that's really a small portion of the actual machine learning workflow right the actual uh machine learning workflow the vast majority of time is on cleaning and organizing your data then you have something called feature engineering which is you pre-process the feature variables of your original data set prior to using them to train the model and this is either through addition deletion combination or transformation of these variables and then the idea is you want to improve the predictive accuracy of the model and also because some models can only work with numeric data so you need to transform categorical data into numeric data all right so just now um in the earlier slides I showed you that you take your original data set you pum it into algorithm and then couple of hours later you get a machine learning model right so you didn't do anything to your data set to the feature variables in your data set before you pump it into a machine machine learning algorithm so what I showed you earlier is you just take the data set exactly as it is and you just pump it into the algorithm couple of hours later you get the model right uh but that's not what generally happens in in real life in real life you're going to take all the original feature variables from your data set and you're going to transform them in some way so you can see here these are the colums of data from my original data set and before I actually put all these data points from my original data set into my algorithm to train and get my model I will actually transform them okay so the transformation of these feature variable values we call this feature engineering and there are many many techniques to do feature engineering so one hot encoding scaling log transformation descri discretization date extraction Boolean logic etc etc okay then finally we do something called a train test plate so where we take our original data set right so this was the original data set and we break it into two parts so one is called the training data set and the other is called the test data set and the primary purpose for this is when we feed and train the machine learning model we're going to use what is called the training data set and we when we want to evaluate the accuracy of the model right so this is the key part of your machine learning life cycle because you are not only just going to have one possible models because there are a vast range of algorithms that you can use to create a model so fundamentally you have a wide range of choices right like wide range of cars right you want to buy a car you can buy buy a myv you can buy a paroda you can buy a Honda you can buy a mercedesbenz you can buy a Audi you can buy a beamer many many different cars you that available for you if you want to buy a car right same thing with a machine learning model that are aast variety of algorithms that you can choose from in order to create a model and so once you create a model from a given algorithm you need to say hey how accurate is this model that have created from this algorithm and and different algorithms are going to create different models with different rates of accuracy and so the primary purpose of the test data set is to evaluate the ACC accuracy of the model to see hey is this model that I've created using this algorithm is it adequate for me to use in a real life production use case Okay so that's what it's all about okay so this is my original data set I break it into my feature data uh feature data set and also my target variable colum so my feature variable uh colums the target variable colums and then I further break it into a training data set and a test data set the training data set is to use the train to create the machine learning model and then once the machine learning model is created I then use the test data set to evaluate the accuracy of the machine learning model all right and then finally we can see what are the different parts or aspects that go into a successful model so Eda about 10% data cleansing about 20% feature engineering about 25% selecting a specific algorithm about 10% and then training the model from that algorithm about 15% and then finally evaluating the model deciding which is the best model with the highest accuracy rate that's about 20% all right so we have reached the most interesting part of this presentation which is the demonstration of an endtoend machine learning workflow on a real life data set that demonstrates the use case of predictive maintenance so the for the data set for this particular use case I've used a data set from kegle so for those of you are not aware of this kegle is the world's largest open-source Community for data science and Ai and they have a large collection of data sets from all various uh areas of industry and human endeavor and they also have a large collection of models that have been developed using these data sets so here we have a data set for the particular use case predictive maintenance okay so this is some information about the data set uh so in case um you do not know how to get to there this is the URL to click on okay to get to that data set so once you at the data set here you can or the page for about this data set you can see all the information about this data set and you can download the data set in a CSV format okay so let's take a look at the data set so this data set has a total of 10,000 samples okay and these are the feature variables the type the product ID the add temperature process temperature rotational speed talk tool Weare and this is the target variable all right so the target variable is what we are interested in what we are interested in using to train the machine learning model and also what we interested to predict okay so these are the feature variables they describe or they provide information about this particular machine on the production line on the assembly line so you might know the product ID the type the air temperature process temperature rotational speed talk to where right so let's say you've got a iot sensor system that's basically capturing all this data about a product or a machine on your production or assembly line okay and you've also captured information about whether is for a specific uh sample whether that sample uh experien a failure or not okay so the target value of zero okay indicates that there's no failure so zero means no failure and we can see that the vast majority of data points in this data set are no failure and here we can see an example here where you have a case of a failure so a failure is marked as a one positive and no failure is marked as zero negative all right so here we have one type of a failure it's called a power failure and if you scroll down the data set you see there are also other kinds of failures like a towar failure uh we have a over strain failure here for example uh we also have a power failure again and so on so if you scroll down through these 10,000 data points and or if you're familiar with using Excel to filter out values in a colume you can see that in this particular colume here which is the so-called Target variable colume you are going to have the vast majority of values as zero which means no failure and some of the rows or the data points you are going to have a value of one and for those rows that you have a value of one for example example here you are sorry for example here you are going to have different types of failure so like I said just now power failure tool set filia etc etc so we are going to go through the entire machine learning workflow process with this data set so to see an example of that we are going to use a we're going to go to the code section here all right so if I click on the code section here and right down here we have see what is called a data set notebook so this is basically a Jupiter notebook Jupiter is basically an python application which allows you to create a python machine learning program that basically builds your machine learning model assesses or evaluates his accuracy and generates predictions from it okay so here we have a whole bunch of Jupiter notebooks that are available and you can select any one of them all these notebooks are essentially going to process the data from this particular data set so if I go to this code page here I've actually selected a specific notebook that I'm going to run through to demonstrate an endtoend machine learning workflow using various machine learning libraries from the Python programming language okay so the uh particular notebook I'm going to use is this particular notebook here and you can also get the URL for that particular The Notebook from here okay so let's quickly do a quick revision again what are we trying to do here we're trying to build a machine learning classification model right so we said there are two primary areas of supervised learning one is regression which is used to predict a numerical Target variable and the second kind of supervised learning is classification which is what we're doing here we're trying to predict a categorical Target variable okay so in this particular example we actually have two kinds of ways we can classify either a binary classification or a multiclass classification so for binary classification we are only going to classify the product or machine as either it failed or it did not fail okay so if we go back to the data set that I showed you just now if you look at this target variable colume there are only two possible values here they either zero or one zero means there's no fi one means that's a failure okay so this is an example of a binary classification only two possible outcomes zero or one didn't fail or fail all right two possible outcomes and then we can also for the same data set we can extend it and make it a multiclass classification problem all right so if we kind of want to drill down further we can say that not only is there a failure we can actually say that are different types of failures okay so we have one category of class that is basically no failure okay then we have a category for the different types of failures right so you can have a power failure you could have a tool Weare failure uh you could have let's go down here you could have a over strain failure and etc etc so you can have multiple classes of failure in addition to the general overall or the majority class of no failure and that would be a multiclass classification problem so with this data set we are going to see how to make it a binary classification problem and also a multiclass classification problem okay so let's look at the workflow so let's say we've already got the data so right now we do have the data set this is the data set that we have so let's assume we've somehow managed to get this data set from some iot sensors that are monitoring realtime data in our production environment on the assembly line on the production line we've got sensors reading data that gives us all these data that we have in this CSV file Okay so we've already got the data we've retrieved the data now we're going to go on to the cleaning and exploration part of your machine learning life cycle all right so let's look at the data cleaning part so the data cleaning part we interested in uh checking for missing values and maybe removing the rows you missing values okay uh so the kind of things we can sorry the kind of things we can do in missing values we can remove the row missing values we can put in some new values uh some replacement values which could be a average of all the values in that that particular colume etc etc we also try to identify outliers in our data set and also there are a variety of ways to deal with that so this is called Data cleansing which is a really important part of your machine learning workflow right so that's where we are now at we're doing cleansing and then we're going to follow up with exploration so let's look at the actual code that does the cleansing here so here we are right at the start of the uh machine learning uh life cycle here so this is a Jupiter notebook so here we have a brief description of the problem statement all right so this data set reflects real life predictive maintenance enounter industry with measurements from real equipment the features description is taken directly from the data source set so here we have a description of the six key features in our data set type which is the quality of the product the air temperature the process temperature the rotational speed the talk and the towar all right so these are the six feature variables and there are the two target variables so just now I showed you just now there's one target variable which only has two possible values either zero or one okay zero or one means failure or no failure so that will be this colume here right so let me go all the way back up to here so this colume here we already saw it only has two I values is either zero or one and then we also have this column here and this column here is basically the failure type and so the we have as I already demonstrated just now we do have uh several categories of or types of failure and so here we call this multiclass classification so we can either build a binary classification model for this problem domain or we can build a multiclass classification problem all right so this jupyter notebook is going to demonstrate both approaches to us so first step we are going to write all this python code that's going to import all the libraries that we need to use okay so this is basically python code okay and it's importing the relevant machine learn oops we are importing the relevant machine learning libraries related to our domain use case okay then we load in our data set okay so this our data set we describe it we have some quick insights into the data set um and then we just take a look at all the variables of the feature variables Etc and so on we just what we're doing now is just doing a quick overview of the data set so this all this python code here they were writing is allowing us the data scientist to get a quick overview of our data set right okay like how many um V how many rows are there how many columns are there what are the data types of the colums what are the name of the columns etc etc okay then we zoom in on to the Target variables so we look at the Target variables how many uh counts there are of this target variable uh and so on how many different types of failures there are then you want to check whether there are any inconsistencies between the Target and the failure type Etc okay so when you do all this checking you're going to discover there are some discrepancies in your data set so using a specific python code to do checking you're going to say hey you know what there's some errors here right there are nine values that classify as failure and Target variable but as no no failure in the failure type variable so that means there's a discrepancy in your data point right so which are so these are all the ones that are discrepancies because the target variable says one and we already know that Target variable one is supposed to mean that it's a failure right target varable one is supposed to mean that is a failure so we are kind of expecting to see the failure classification but some rows actually say there's no failure although the target type is one but here is a classic example of an error that can very well Ur in a data set so now the question is what do you do with these errors in your data set right so here the data scientist says I think it would make sense to remove those instances and so they write some code then to remove those instances or those uh rows or data points from the overall data set and same thing we can again check for other ISU so we find there's another ISU here with our data set which is another warning so again we can possibly remove them so you're going to remove 20 7 instances or rows from your overall data set so your data set has a 10,000 uh rows or data points you're removing 27 which is only 0.27 of the entire data set and these were the reasons why you remove them okay so if you're just removing to uh 0.27% of the anti data set no big deal right still okay but you needed to remove them because these errors right this 27 um errors okay data points with errors in your data set could really affect the training of your machine learning model so we need to do your data cleansing right so we are actually cleansing now uh uh some kind of data that is incorrect or erroneous in your original data set okay so then we go on to the next part which is called Eda right so Eda is where we kind of explore our data and we want to kind of get a visual overview of our data as a whole and also take a look at the statistical properties of data the statistical distribution of the data in all the various colums the correlation between the variables between the feature variables different columns and also the feature variable and the target variable so all of this is called Eda and Eda in a machine learning workflow is typically done through visualization all right so let's go back here and take a look right so for example here we are looking at correlation so we plot the values of all the various feature variables against each other and look for potential correlations and patterns and so on and all the different shapes that you see here in this pair plot okay uh will have different meaning statistical meaning and so the data scientist has to kind of visually inspect this P plot makes some interpretations of these different patterns that he sees here all right so these are some of the insights that that can be deduced from looking at these pattern so for example the Tor and rotational speed are highly correlated the process temperature and a temperature so highly correlated that failures occur for extreme values of some features etc etc then you can plot certain kinds of charts this called a violing chart to again get new insights for example regarding the talk and rotational speed it can see again that most failures are triggered for much lower or much higher values than the mean when they're not failing so all these visualizations they are there and a trained data scientist can look at them inspect them and make some kind of insightful deductions from them okay percentage of failure right uh the correlation heat map okay between all these different feature variables and also the target variable okay uh the product types percentage of product types percentage of failure with respect to the product type so we can also kind of visualize that as well so certain products have a higher ratio of faure compared to other product types Etc or for example uh M tends to feel more than H products etc etc so we can create a vast variety of visualizations in the Eda stage so you can see here and again the idea of this visualization is just to give us some insight some preliminary insight into our data set that helps us to model it more correctly so some more insights that we get into our data set from all this visualization then we can plot the distribution so we can see whether it's a normal distribution or some other kind of distribution uh we can have a box plot to see whether there are any outliers in your data set and so on right so we can see from the box plots we can see rotational speed and have outliers so we already saw outliers are basically a problem that you may need to kind of tackle right so outliers are an isue uh it's a it's a part of data cleansing and so you may need to tackle this so we may have to check okay well where are the potential outliers so we can analyze them from the box blot okay um but then we can say well they are outliers but maybe they're not really horrible outliers so we can tolerate them or maybe we want to remove them so we can see what the mean and maximum values for all these with respect to product type how many of them are above or highly correlated with the product type in terms of the maximum and minimum okay and then so on so the Insight is well we got 4.8% of the instances are outliers so maybe 4.87% is not really that much the outliers are not horrible so we just leave them in the data set now for a different data set the data scientist could come to different conclusion so then they would do whatever they've deemed is appropriate to kind of cleanse the data set okay so now that we have done all the Eda the next thing we're going to do is we are going to do what is called feature engineering so we are going to transform our original feature variables and these are our original feature variables right these are our original feature variables and we are going to transform them all right we're going to transform them in some sense uh into some other form before we fit this for training into our machine learning algorithm all right so these are examples of let's say this example of a original data set right and this is examples these are some of the examples you don't have to use all of them but these are some of examples of what we call feature engineering which you can then transform your original values in your feature variables to all these transform values here so we're going to pretty much do that here so we have a ordinal encoding we do scaling of the data so the data set is scaled we use a minmax scaling and then finally we come to do a modeling so we have to split our data set into a training data set and a test data set so coming back to again um we said that in a before you train your model sorry before you train your model you have to take your original data set now this is a featured engineered data set we're going to break it into two or more subsets okay so one is called the training data set that we use to Feit and train a machine learning model the second is test data set to evaluate the accuracy of the model okay so we got this training data set your test data set and we also need to sample so from our original data set we need to sample sample some points that go into your training data set some points that go in your test data set so there are many ways to do sampling one way is to do stratified sampling where we ensure the same proportion of data from each steta or class because right now we have a multiclass classification problem so you want to make sure the same proportion of data from each TR class is equally proportional in the training and test data set as the original data set which is very useful for dealing with what is called an imbalanced data set so here we have an example of what is called an imbalanced data set in the sense that you have the vast majority of data points in your data set they are going to have the value of zero for their target variable colume so only a extremely small minority of the data points in your data set will actually have the value of one for their target variable colume okay so a situation where you have your class or your target variable colume where the vast majority of values are from one class and a tiny small minority are from another class we call this an imbalanced data set and for an imbalanced data set typically we will have a specific technique to do the train test split which is called stratified sampling and so that's what's exactly happening here we're doing a stratified split here so we are doing a train test split here uh and we are doing a stratified split uh and then now we actually develop the models so now we've got the train test plate now here is where we actually train the models now in terms of classification there are a whole bunch of possibilities right that you can use there are many many different algorithms that we can use to create a classification model so this are an example of some of the more common ones logistic support Vector machine decision trees random Forest bagging balance bagging boost assemble Ensemble so all these are different algorithms which will create different kind of models which will result in different accuracy measures okay so it's the goal of the data scientist to find the best model that gives the best accuracy for the given data set for training on that given data set so let's head back again to uh our machine learning workflow so here basically what I'm doing is I'm creating a whole bunch of models here all right so one is a random Forest one is balance bagging one is a boost classifier one's The Ensemble classifier and using all of these I am going to basically Feit or train my model using all these algorithms and then I'm going to evaluate them okay I'm going to evaluate how good each of these models are and here you can see your value your evaluation data right okay and this is the confusion Matrix which is another way of evaluating so now we come to the kind of the the the key part here which is which is how do I distinguish between all these models right I've got all these different models which are built with different algorithms which I'm using to train on the same data set how do I distinguish between all these models okay and so for that sense for that we actually have a whole bunch of common evaluation matrics for classification right so this evaluation matrics tell us how good a model is in terms of its accuracy in classification so in terms of accuracy we actually have many different models uh sorry many different measures right you might think well accuracy is just accuracy well that's all right it's just either it's accurate or it's not accurate right but actually it's not that simple there are many different ways to measure the accuracy of a classification model and these are some of the more common ones so for example the confusion metrix tells us how many true positives that means the value is positive the prediction is positive how many false FAL positives which means the value is negative the machine learning model predicts positive how many false negatives which means that the machine learning model predicts negative but it's actually positive and how many true negatives there are which means that the machine the machine learning model predicts negative and the true value is also negative so this is called a confusion Matrix this is one way we assess or evaluate the performance of a classification model okay this is for binary classification we can also have multiclass confusion Matrix and then we can also measure things like accuracy so accuracy is the true positives plus the true negatives which is the total number of correct predictions made by the model divided by the total number of data points in your data set and then you have also other kinds of measures uh such as recall and this is a formula for recall this is a formula for the F1 score okay and then there's something called the uh R curve right so without going too much in the detail of what each of these entails essentially these are all different ways these are different kpi right just like if you work in a company you have different kpi right certain employees have certain kpi that measures how good or how how uh you know efficient or how effective a particular employee is right so the kpi kpi for your machine learning models are Roc curve F1 score recall accuracy okay and your confusion Matrix so so fundamentally after I have built right so here I've built my four different models so after I built these form different models I'm going to check and evaluate them using all those different metrics like for example the F1 score the Precision score the recall score all right so for this model I can check out the ROC score the F1 score the Precision score the recall score then for this model this is the ROC score the F1 score the Precision score the recall called then for this model and so on so for every single model I've created using my training data set I will have all my set of evaluation metrics that I can use to evaluate how good this model is okay same thing here I've got a confusion Matrix here right so I can use that again to evaluate between all these four different models and then I kind of summarize it up here so we can see from this summary here that actually the top two models right which are I'm going to give a lot as a data scientist I'm now going to just focus on these two models so these two models are begging classifier and random Forest classifier they have the highest values of F1 score and the highest values of the rooc curve score okay so we can say these are the top two models in terms of accuracy okay using the fub1 evaluation metric and the r Au evaluation metric okay so these results uh kind of summarize here and then we use different sampling techniques okay so just now I talked about um different kinds of sampling techniques and so the idea of different kinds of sampling techniques is to just get a different feel for different distributions of the data in different areas of your data set so that you want to just kind of make sure that your your your evaluation of accuracy is actually statistically correct right so we can um do what is called oversampling and under sampling which is very useful when you're working with an imbalance data set so this is example of doing that and then here we again again check out the results for all these different techniques we use uh the F1 score the Au score all right these are the two key measures of accuracy right so and then we can check out the scores for the different approaches okay so we can see oh well overall the models have lower Au r r Au C score but they have a much higher F1 score the begging classifier had the highest R1 highest roc1 score but F1 score was too low okay then in the data scientist opinion the random forest with this particular technique of sampling has equilibrium between the F1 R F1 R and A score so the takeaway one is the macro F1 score improves dramatically using the sampl sampling techniqu so these models might be better compared to the balanced ones all right so based on all this uh evaluation the data scientist says they're going to continue to work with these two models all right and the balance begging one and then continue to make further comparisons all right so then we continue to keep refining on our evaluation work here we're going to train the models one more time again so we again do a training test plate and then we do that for this particular uh approach model and then we print out we print out what is called a classification report and this is basically a summary of all those metrics that I talk about just now so just now remember I said the the there was several evaluation metrics right so uh we had the confusion matrics the accuracy the Precision the recall the Au ccore so here with the um classification report I can get a summary of all of that so I can see all the values here okay for this particular model begging Tomac links and then I can do that for another model the random Forest borderline SME and then I can do that for another model which is the balance ping so again we see this a lot of comparison between different models trying to figure out what all these evaluation metrics are telling us all right then again we have a confusion Matrix so we generate a confusion Matrix for the bagging with the toac links under sampling for the random followers with the borderline mod over sampling and just balance begging by itself then again we compare between these three uh models uh using the confusion Matrix evaluation Matrix and then we can kind of come to some conclusions all right so right so now we look at all the data then we move on and look at another um another kind of evaluation metrix which is the r score right so this is one of the other evaluation metrics I talk about so this one is a kind of a curve you look at it to see the area underneath the curve this is called AOC R area under the curve sorry Au Au R area under the curve all right so the area under the curve uh score will give us some idea about the threshold that we're going to use for classif ification so we can examine this for the bagging classifier for the random forest classifier for the balance bagging classifier okay then we can also again do that uh finally we can check the classification report of this particular model so we keep doing this over and over again evaluating this m The Matrix the the accuracy Matrix the evaluation Matrix for all these different models so we keep doing this over and over again for different thresholds or for classification and so as we keep drilling into these we kind of get more and more understanding of all these different models which one is the best one that gives the best performance for our data set okay so finally we come to this conclusion this particular model is not able to reduce the record on failure test than 95.8% on the other hand balance begging with a decision thresold of 0.6 is able to have a better recall blah blah blah Etc so finally after having done all of this evalu ations okay this is the conclusion so after having gone so right now we have gone through all the steps of the Machining learning life cycle and which means we have right now or the data scientist right now has gone through all these steps uh which is now we have done this validation so we have done the cleaning exploration preparation transformation the future engineering we have developed and trained multiple models we have evaluated all these different models so right now we have reached this stage so at this stage we as the data scientist kind of have completed our job so we've come to some very useful conclusions which we now can share with our colleagues all right and based on this uh conclusions or recommendations somebody is going to choose a appropriate model and that model is going to get deployed for realtime use in a real life production environment okay and that decision is going to be made based on the recommendations coming from the data scientist at the end of this phase okay so at the end of this phase the data scientist is going to come up with these conclusions so conclusions is okay if the engineering team they are looking okay the engineering team right the engineering team if they are looking for the highest failure detection rate possible then they should go with this particular model okay and if they want a balance between precision and recall then they should choose between the begging model with a 0.4 decision threshold or the random forest model with a 0.5 threshold but if they don't care so much about predicting every failure and they want the highest Precision possible then they should opt for the begging toax link classifier with a bit higher decision threshold and so this is the key thing that the data scientist is going to give right this is the key takeaway this is the kind of the end result of the entire machine learning life cycle right now the data scientist is going to tell the engineering team all right you guys which is more important for you point a point B or Point C make your decision so the engineering team will then discuss among themselves and say hey you know what what we want is we want to get the highest failure detection possible because any kind kind of failure of that machine or the product on the samply line is really going to screw us up big time so what we're looking for is the model that will give us the highest failure detection rate we don't care about Precision but we want to be make sure that if there's a failure we are going to catch it right so that's what they want and so the data scientist will say Hey you go for the balance begging model okay then the data scientist saves this all right uh and then once you have saved this uh you can then go right ahead and deploy that so you can go right ahead and deploy that to production okay and so if you want to continue we can actually further continue this modeling problem so just now I model this problem as a binary classification problem uh sorry just I modeled this problem as a binary classification which means it's either zero or one either fail or not fail but we can also model it as a multiclass classification problem right because as as I said earlier just now for the Target variable colum which is sorry for the failure type colume you actually have multiple kinds of failures right for example you may have a power failure uh you may have a towar failure uh you may have a overstrain failure so now we can model the problem slightly differently so we can model it as a multiclass classification problem and then we go through the entire same process that we went through just now so we create different models we test this out but now the confusion Matrix is for a multiclass classification isue right so we're going to check them out we're going to again uh try different algorithms or models again train and test our data set do the training test split uh on these different models all right so we have like for example we have bon random Forest B random Forest a great search then you train the models using what is called hyperparameter tuning then you get the scores all right so you get the same evaluation scores again you check out the evaluation scores compare between them generate a confusion Matrix so this is a multiclass confusion Matrix and then you come to the final conclusion so now if you are interested to frame your problem domain as a multiclass classification problem all right then these are the recommendations from the data scientist so the data scientist will say you know what I'm going to pick this particular model the balance backing classifier and these are all the reasons that the data scientist is going to give as a rational for selecting this particular model and then once that's done you save the model and that's that's it that's it so that's all done now and so then the uh the model the machine learning model now you can put it live run it on the server and now the machine learning model is ready to work which means it's ready to generate predictions right that's the main job of the machine learning model you have picked the best machine learning model with the best evaluation metrics for whatever accur see goal you're trying to achieve and now you're going to run it on a server and now you're going to get all this real time data that's coming from your sensus you're going to pump that into your machine learning model your machine learning model will pump out a whole bunch of predictions and we're going to use that predictions in real time to make real time real world decision making right you're going to say okay I'm predicting that that machine is going to fail on Thursday at 5:00 p.m. so you better get your service folks in to service it on Thursday 2: p.m. or you know whatever so you can you know uh make decisions on when you want to do your maintenance you know and and make the best decisions to optimize the cost of Maintenance etc etc and then based on the results that are coming up from the predictions so the predictions may be good the predictions may be lousy the predictions may be average right so we are we're constantly monitoring how good or how useful are the predictions generated by this realtime model that's running on the server and based on our monitoring we will then take some new data and then repeat this entire life cycle again so this is basically a workflow that's iterative and we are constantly or the data scientist is constantly getting in all these new data points and then refining the model picking maybe a new model deploying the new model onto the server and so on all right and so that's it so that is basically your machine learning workflow in a nutshell okay so for this particular approach we have used a bunch of uh data science libraries from python so we have used pandas which is the most B basic data science libraries that provides all the tools to work with raw data we have used numai which is a high performance library for implementing complex array metrix operations we have used met plot lip and cbon which is used for doing the Eda the explorat exploratory data analysis phase machine learning where you visualize all your data we have used psyit learn which is the machine L learning library to do all your implementation for all your call machine learning algorithms uh we we we have not used this because this is not a deep learning uh problem but if you are working with a deep learning problem like image classification image recognition object detection okay natural language processing text classification well then you're going to use these libraries from python which is tensor flow okay and also py to and then lastly that whole thing that whole data science project that you saw just now this entire data science project is actually developed in something called a Jupiter notebook so all this python code along with all the observations from the data scientists okay for this entire data science project was actually run in something called a Jupiter notebook so that is uh the most widely used tool for interactively developing and presenting data science projects okay so that brings me to the end of this entire presentation I hope that you find it useful for you and that you can appreciate the importance of machine learning and how it can be applied in a real life use case in a typical production environment all right thank you all so much for watching