Hello everyone, my name is Victor. I'm

your friendly neighborhood data

scientist from DreamCatcher. So in this

presentation, I would like to talk about

a specific industry use case of AI or

machine learning which is predictive

maintenance. So I will be covering these

topics and feel free to jump forward to

the specific part in the video where I

talk about all these topics. So I'm going

to start off with a general preview of

AI and machine learning. Then, I'll

discuss the use case which is predictive

maintenance. I'll talk about the basics

of machine learning, the workflow of

machine learning, and then we will come

to the meat of this presentation which

is essentially a demonstration of the

machine learning workflow from end to

end on a real life predictive

maintenance domain problem. All right, so

without any further ado, let's jump into

it. So let's start off with a quick

preview of AI and machine learning. Well

AI is a very general term, it encompasses

the entire area of science and

engineering that is related to creating

software programs and machines that

will be capable of performing tasks

that would normally require human

intelligence. But AI is a catchall term,

so really when we talk about apply AI,

how we use AI in our daily work, we are

really going to be talking about machine

learning. So machine learning is the

design and application of software

algorithms that are capable of learning

on their own without any explicit human

intervention. And the primary purpose of

these algorithms are to optimize

performance in a specific task. And the

primary performance or the primary task

that you want to optimize performance in

is to be able to make accurate

predictions about future outcomes based

on the analysis of historical data

from the past. So essentially machine

learning is about making predictions

about the future or what we call

predictive analytics.

And there are many different

kinds of algorithms that are available in

machine learning under the three primary

categories of supervised learning,

unsupervised learning, and reinforcement

learning. And here we can see some of the

different kinds of algorithms and their

use cases in various areas in

industry. So we have various domain use

cases

for all these different kind of

algorithms, and we can see that different

algorithms are fitted for different use cases.

Deep learning is an advanced form

of machine learning that's based on

something called an artificial neural

network or ANN for short, and this

essentially simulates the structure of

the human brain whereby neurons

interconnect and work together to

process and learn new information. So DL

is the foundational technology for most

of the popular AI tools that you

probably have heard of today. So I'm sure

you have heard of ChatGPT if you haven't

been living in a cave for the past 2

years. And yeah, so ChatGPT is an example

of what we call a large language model

and that's based on this technology

called deep learning. Also, all the modern

computer vision applications where a

computer program can classify images or

detect images or recognize images on

its own, okay, we call this computer

vision applications. They also use

this particular form of machine learning

called deep learning, right? So this is a

example of an artificial neural network.

For example, here I have an image of a

bird that's fed into this artificial

neural network, and output from this

artificial neural network is a

classification of this image into one of

these three potential categories. So in

this case if the a Ann has been trained

properly uh we fit in this image this

a&amp;n should correctly classify this image

as a bird right so this is image

classification problem which is a

classic use case for an artificial

neural network in the field of computer

vision and and just like in the case of

machine learning there are a variety of

algorithms uh that are available for

deep learning under the category of

supervised learning and also

unsupervised

learning all right so this is how we can

kind of categorize this you can think of

AI is a general area of Smart Systems

and machine machine learning is

basically applied Ai and deep learning

is a

subspecialization of machine learning

using a particular architecture called

an artificial neural

network and generative AI so if you talk

about chat GPT Okay Google Gemini

Microsoft co-pilot okay all these

examples of generative AI they are

basically large language models and they

are a further subcategory within the

area of deep

learning and there are many applications

of machine learning in Industry right

now so pick which particular industry

you involved in and these are all the

specific areas of

applications right uh so probably I'm

going to guess the vast majority of you

who are watching this video you're

probably coming from the manufacturing

industry and so in the manufacturing

industry some of the standard use cases

for machine learning and deep learning

are predicting potential problems okay

so sometimes you call this uh predictive

maintenance where you want to predict

when a problem is going to happen and

then kind of address it before it

happens and then monitoring systems

automating your manufacturing assembly

line or production line okay smart

scheduling and detecting anomaly on your

production

line okay so let's talk about the use

case here which is predictive

maintenance right so what is predictive

maintenance well predictive maintenance

uh here's the long definition is a

equipment maintenance strategy that

relies on real-time monitoring of

equipment conditions and data to predict

equipment failures in advance

and this uses Advanced Data models

analytics and machine learning whereby

we can reliably assess when failures are

more likely to occur including which

components are more likely to be

affected on your production or assembly

line so where does pred predictive

maintenance fit into the overall scheme

of things right so let's talk about the

kind of standard way that you know

factories or production uh or production

lines assembly lines in factories tend

to handle uh Main maintenance issues say

10 or 20 years ago right so what you

have is the what you would probably

start off is is the most basic mode

which is reactive maintenance so you

just wait until your machine breaks down

and then you repair right the simplest

but of course I'm sure if you work on a

production line for any period of time

you know that this reactive maintenance

can give you a whole bunch of headaches

especially if the machine breaks down

just before a critical delivery dat line

right then you're you're going to have a

backlog of orders and you're going to

run a lot of problems okay so we move on

to PR preventive maintenance which is

you regularly schedule a maintenance of

your production machines uh to reduce

the failure rate so you might do

maintenance once every month once every

two weeks whatever okay this is great

but the problem of course then is well

sometimes you're doing too much

maintenance it's not really necessary

and it still doesn't totally uh prevent

this uh you know uh a failure of the

machine that occurs outside of your plan

maintenance right so a bit of

improvement but not not that much better

and then these last two categories is

where we bring in Ai and machine

learning so with machine learning we're

going to use sensors to do real-time

monitoring of the data and then using

that data we're going to build a machine

learning model which helps us to predict

with a reasonable level of accuracy when

the next failure is going to happen on

your assembly or production line on a

specific component or specific machine

right so you just want to be predict to

to a high level of accuracy like maybe

to the specific day even the specific

hour or even minute itself when you

expect that particular product to fail

or the particular machine to fail all

right so these are advantages of

predictive maintenance it minimizes

occurrence of unscheduled downtime it

gives you a realtime overview of your

current condition of assets ensures

minimal disruptions of productivity uh

optimizes time span on maintenance work

optimizes the use of spare parts and so

on and of course there are some

disadvantages with which is uh the

primary one you need a specialized set

of skills among your engineers to

understand and create machine learning

models that can work on the realtime

data that you're getting okay so we're

going to take a look at some real life

use cases so these are a bunch of links

here so if you navigate to these links

here you'll be able to get a look at

some real life use cases of um machine

learning uh in predictive maintenance so

the IBM website okay gives you a look at

a bunch of five use cases so you can

click on these links and follow up with

them if you want to read more okay this

is Waste Management manufacturing okay

Building Services and renewable energy

and also mining right so these are all

use cases if you want to know more about

them you can read up and follow them

from this website uh and this website

gives uh this is a pretty good website I

would really encourage you to just look

through this if you're interested in

predictive maintenance so here it tells

you about you know an industry survey of

predictive maintenance we can see that a

large portion of the industry

manufacturing industry agreed that

predictive maintenance is a real need to

stay competitive uh and predictive

maintenance is essential for

manufacturing industry and will gain

additional strength in the future so

this is a survey that was done um quite

some time ago and this was the results

that we got back so we can see the vast

majority of key industry players in the

manufacturing sector they consider

predictive maintenance to be very

important

um activity that they want to

incorporate into their workflow right

and we can see here the kind of Roi that

we expect on investment in predictive

maintenance so 45% reduction in downtime

25% growth in productivity 75% fault

elimination 30% reduction in maintenance

cost okay and best of all if you really

want to kind of take a look at examples

all right so there are all these

different companies that have uh

significantly invested in predictive

maintenance technology in their

manufacturing processes So pepsic Co we

have got uh Frito General motos Mii EOP

plan all right so you can jump over here

and and take a look at some of these uh

use cases let me perhaps let me try and

open this up for example Mii right you

can see Mii has impl oops Mii has used

uh this particular piece of software

called ma lab all right or math work

sorry uh to do uh predictive maintenance

for their manufacturing processes using

machine learning and we can talk you can

study how they have used it all right

and how it works what was their

challenge all right the problems that

were facing the solution that they use

using this MathWorks Consulting piece of

software and data that they collected in

a uh matlb database all right uh sorry

in a Oracle databas uh Oracle database

so using math works from math lab all

right they were able to create a deep

learning model to to to you know to

solve this particular issue for their

domain so if you're interested please I

strongly encourage you to read up on all

these real life customer Stories We

showcase uh use cases for predictive

maintenance okay so that's it for uh

real life use cases for predictive

maintenance now in this uh topic I'm I'm

going to talk about machine learning

Basics so what is is actually involved

in machine learning and I'm going to

give a very quick fast conceptual high

level overview of machine learning all

right so there are several categories of

machine learning supervised unsupervised

semi-supervised reinforcement and deep

learning okay and let's talk about the

most common and widely used category of

machine learning which is called

supervised learning so the par um use

case here that I'm going to be

discussing predictive maintenance it's

basically a form of supervised learning

so how does supervised learning work

well in supervised learning you're going

to create a machine learning model by

providing what is called a labeled data

set as a input to a machine learning

program or algorithm and this data set

is going to contain what is called an

independent of feature variables all

right so this will be a set of variables

and there will be one dependent or

Target variable which we also call the

label and the idea is that the

independent or the feature variable are

the attributes or properties of your

data set that influence the dependent or

the target variable okay so this process

that I've just described is called

training the machine learning model and

the model is fundamentally a

mathematical function that best

approximates the relationship between

the independent variables and the

dependent variable all right so there's

quite a bit of a mouthful so let's jump

into a diagram that maybe illustrates

this more clearly so let's say you have

a data set here an Excel spreadsheet

right and this Excel spreadsheet has a

bunch of columns here and a bunch of

rows okay so these rows here represent

observations or or these rows are what

we call observations or samples or data

points in our data set okay so let's

assume this data set is uh gathered by a

marketing manager at a mall at a retail

mall all right so they've got all this

uh information about the customers who

purchase products at this mall all right

so some of the information they've

gotten about the customers are their

gender their age their income and the

number of children so all this

information about the customers we call

this the independent of the feature

variables all right and based on all

this information about the customer we

also managed to get some or we record

the information about how much the

customer spends all right so this uh

information or this numbers here we call

this the target variable or the

dependent variable right so on the

single Row the data point one single one

single data point contains all the data

for the feature variables and one single

value for the label or the target

variable okay and the primary purpose of

the machine learning model is to create

a mapping from all your feature

variables to your target variable so

somehow there's going to be a function

okay this will be a mathematical

function that maps all the values of

your feature variable to the value of

your target variable in other words this

function represents the relationship

between your future variables and your

target variable okay so this whole thing

this training process we call this the

fitting the model and the target

variable or the label this thing here

this colume here or the values here

these are critical for providing a

context to do the fitting of the

training of the model and once you've

got a trained and fitted model you can

then use the model to make an accurate

prediction of Target values

corresponding to new feature values that

the model has yet to encounter or yet to

see and this as I've already said

earlier this is called Predictive

Analytics okay so let's see what's

actually happening here you take your

training data all right so this is this

whole bunch of data this data set here

cons consisting of a thousand rows of

data 10,000 rows of data you take this

entire data set all right this entire

data set you jam it into your machine

learning algorithm and a couple of hours

later your machine learning algorithm

comes out with a model and the model is

essentially a function that maps all

your feature variables which is these

four columns here to your target

variable which is this one single colume

here okay so once you have the model you

can put in a new data point so basically

the new data point represents data about

new customer a new customer that you

have never seen before so let's say

you've already got information about

10,000 customers that have visited this

mall and how much each of these 10,000

customers have spent when they at this

mall so now you have a totally new

customer that comes in the mall this

customer has never come into this mall

before and what we know about this

customer is that he is a male the age is

50 the income is 18 and they have nine

children so now when you take this data

and you pump that into your model your

model is going to make a prediction it's

going to say hey you know what based on

everything that have been trained before

and based on the model I've developed

I am going to predict that a customer

that is of a male gender of the age 50

with the income of 18 and nine children

that customer is going to spend 25 ring

at the mall and this is it this is what

you want right right there right here

can you see here that is the final

output of your machine learning model

it's going to make a prediction about

something that it has not ever seen

before okay that is the core this is

essentially the core of machine learning

Predictive Analytics making prediction

about the

future based on a historical data

set okay so there are two areas of

supervised learning regression and

classification so regression is used to

predict a numerical Target variable such

as the price of a house or the salary of

an employee whereas classification is

used to predict a c categorical Target

variable or class label okay so for

classification you can have either

binary or multiclass so for example

binary will be just true or false zero

or one so whether your machine is going

to fail or is it not going to fail right

so just two classes two possible

outcomes or is the customer going to

make a purchase or is the customer not

going to make a purchase uh we call this

binary classification and then for

multiclass when there are more than two

classes or types of values so for

example here this would be a

classification problem so if you have a

data set here you've got information

about your customers you've got your

gender of the customer the age of the

customer the salary of the customer and

you also have record about whether the

customer made a purchase or not okay so

you can take this data set to train a

classification model and then the

classification model can then make a

prediction about a new customer and

they're going to predict zero which

means the customer didn't make a

purchase or one which means the customer

make a purchase right and regression

this is regression so let's say you want

to predict the wind speed and you've got

historical data about all these four

other independent variables or feature

variables so you have recorded

temperature the pressure the relative

humidity and the wind direction for the

past 10 days 15 days or whatever okay so

now you are going to train your machine

learning model using this data set and

and the target variable colume okay this

colume here the label is basically a

number right so now with this number

this is a regression model and so now

you can put in a new data point so a new

data point means a new set of values for

temperature pressure relative humidity

and wind direction and your machine

learning model will then predict the

wind speed for that new data point okay

so that's a regression model

all right so in this particular topic

I'm going to talk about the workflow of

that's involved in machine learning so

in the previous um slides I talked about

developing the model all right but

that's just one part of the entire

workflow so in real life when you use

machine learning there's an endtoend

workflow that's involved so the first

thing of course is you need to get your

data and then you need to clean your

data and then you need to exploore your

data you need to see what's going on in

your data set right and your data set

real life data sets are not trivial they

are hundreds of rows thousands of rows

sometimes millions of rows billions of

rows we're talking about billions or

millions of data points especially if

you're using an iot sensor to get data

in real time so you've got all these

super large data sets you need to clean

them and explore them and then you need

to prepare them into a right format so

that you can put them into the training

process to create your machine learning

model and then subsequently you check

how good is the model right how accurate

is the model in terms of its ability to

generate predictions or or for the

future right how accurate are the

predictions that are coming up from your

machine learning model so that's

validating or evaluating your model and

then subsequently if you determine that

your model is of adequate accuracy to

meet whatever your domain use case

requirements are right so let's say the

accuracy that's required for your domain

use case is

85% okay if my machine learning model

can give an 85% accuracy rate I think

it's good enough then I'm going to

deploy it into rail world use case so

here the machine learning model gets uh

deployed on the server and then um other

you know other data sources are going to

be captured from somewhere that data is

pump into the machine learning model the

machine learning model generates

predictions and those predictions are

then used to make decisions on the

factory floor in real time or in any

other particular scenario and then you

constantly Monitor and update the model

you get more new data and then the

entire cycle repeats itself so that's

your machine learning workflow okay in a

nutshell uh here's another example of

this the same thing maybe in a slightly

different format so again you have your

data collection and preparation here we

talk more about the different kinds of

algorithms that available to create a

model and I'll talk about this more in

detail when we look at the real world

example of a endtoend machine learning

workflow for the predictive maintenance

use case so once you have chosen the

appropriate algorithm you then have

trained your model you then have

selected the appropriate train model

among the multiple models you you are

probably going to develop multiple

models from multiple algorithms you're

going to evaluate them all and then

you're going to say hey you know what

after I've evaluated and tested that

I've chosen the best model I'm going to

deploy the model all right so this is

for Real Life production use okay real

life sensor data is going to be pumped

into my model my model is going to

generate predictions the predicted data

is going to used immediately in real

time for real life decision making and

then I'm going to monitor right the

results so somebody's using the

predictions from my model if the

predictions are lousy that goes into the

monitoring the monitoring system

captures that if the predictions are

fantastic well that also captured by the

monitoring system system and that gets

feedback again to the next cycle of my

machine learning

pipeline okay so that's the kind of

overall View and here are the kind of

key phases of your workflow so one of

the important phases is called Eda

exploratory data analysis and in this

particular uh phase uh you're going to

do a lot of stuff primarily just to

understand your data set so like I said

real life data sets they tend to be very

complex and they tend to have various

statistical properties all right

statistics is a very important component

of machine learning so an Eda helps you

to kind of get an overview of your data

set get an overview of any problems in

your data set like any data that's

missing the statistical properties your

data set the distribution of your data

set the statistical correlation of

variables in your data set etc

etc okay then we have data cleaning or

sometimes you call it data cleansing and

in this phase what you want to do is

primarily you want to kind of do things

like remove duplicate records or rows in

your table you want to make sure that

there I your your data or your data

points your samples have appropriate IDs

and most importantly you want to make

sure there's not too many missing values

in your data set so what I mean by

missing values are things like that

right you have got a data set and for

some reason there are some cells or

locations in your data set which are

missing values right and if you have a

lot of these missing values then you've

got a poor quality data set and you're

not going to be able to build a good

model from this data set you're not

going to be able to train a good machine

learning model from a data set with a

lot of missing values like this so you

have to figure out whether there are a

lot of missing values in your data set

how do you handle them another thing

that's important in data cleansing is

figuring out the outliers in your data

set so uh outliers are things like this

you know data points are very far from

the general trend of data points in your

data set right and and so there are also

several ways to detect outliers in your

data set and there are several ways to

handle outliers in your data set

similarly as well there are several ways

to handle missing values in your data

set so handling missing values handling

outliers those are really two very key

importance of data

cleansing and there are many many

techniques to handle this so a data

scientist needs to be acquainted with

all of this all right why do I need to

do data cleansing well here is the key

point

if you have a very poor quality data set

which means youve got a lot of outliers

which are errors in your data set or you

got a lot of missing values in your data

set even though youve got a fantastic

algorithm you've got a fantastic model

the predictions that your model is going

to give is absolutely rubbish it's kind

of like taking water and putting water

into the tank of a mercedesbenz so

Mercedes-Benz is a great car but if you

take water and put it into your

mercedes-ben it will just die right your

car will just die can't run on on water

right on the other hand if you have a

myv myv is just a lousy car but if

you take a high octane good Patrol and

you point to a MV the MV will just go at

you know 100 Mil hour it which just

completely destroy the Mercedes-Benz in

terms of performance so it doesn't it

doesn't really matter what model you're

using right so you can be using the most

Fantastic Model like the the

mercedesbenz or machine learning but if

your data is lousy quality your

predictions is also going to be rubbish

okay so cleansing data set is in fact

probably the most important thing that

data scientists need to do and that's

what they spend most of the time doing

right building the model trading the

model getting the right algorithms and

so on that's really a small portion of

the actual machine learning workflow

right the actual uh machine learning

workflow the vast majority of time is on

cleaning and organizing your

data then you have something called

feature engineering which is you

pre-process the feature variables of

your original data set prior to using

them to train the model and this is

either through addition deletion

combination or transformation of these

variables and then the idea is you want

to improve the predictive accuracy of

the model and also because some models

can only work with numeric data so you

need to transform categorical data into

numeric data all right so just now um in

the earlier slides I showed you that you

take your original data set you pum it

into algorithm and then couple of hours

later you get a machine learning model

right so you didn't do anything to your

data set to the feature variables in

your data set before you pump it into a

machine machine learning algorithm so

what I showed you earlier is you just

take the data set exactly as it is and

you just pump it into the algorithm

couple of hours later you get the model

right uh but that's not what generally

happens in in real life in real life

you're going to take all the original

feature variables from your data set and

you're going to transform them in some

way so you can see here these are the

colums of data from my original data set

and before I actually put all these data

points from my original data set into my

algorithm to train and get my model I

will actually transform them okay so the

transformation of these feature variable

values we call this feature engineering

and there are many many techniques to do

feature engineering so one hot encoding

scaling log transformation descri

discretization date extraction Boolean

logic etc

etc okay then finally we do something

called a train test plate so where we

take our original data set right so this

was the original data set and we break

it into two parts so one is called the

training data set and the other is

called the test data set and the primary

purpose for this is when we feed and

train the machine learning model we're

going to use what is called the training

data set and we when we want to evaluate

the accuracy of the model right so this

is the key part of your machine learning

life cycle because you are not only just

going to have one possible models

because there are a vast range of

algorithms that you can use to create a

model so fundamentally you have a wide

range of choices right like wide range

of cars right you want to buy a car you

can buy buy a myv you can buy a paroda

you can buy a Honda you can buy a

mercedesbenz you can buy a Audi you can

buy a beamer many many different cars

you that available for you if you want

to buy a car right same thing with a

machine learning model that are aast

variety of algorithms that you can

choose from in order to create a model

and so once you create a model from a

given algorithm you need to say hey how

accurate is this model that have created

from this algorithm and and different

algorithms are going to create different

models with different rates of accuracy

and so the primary purpose of the test

data set is to evaluate the ACC accuracy

of the model to see hey is this model

that I've created using this algorithm

is it adequate for me to use in a real

life production use case Okay so that's

what it's all about okay so this is my

original data set I break it into my

feature data uh feature data set and

also my target variable colum so my

feature variable uh colums the target

variable colums and then I further break

it into a training data set and a test

data set the training data set is to use

the train to create the machine learning

model and then once the machine learning

model is created I then use the test

data set to evaluate the accuracy of the

machine learning

model all right and then finally we can

see what are the different parts or

aspects that go into a successful model

so Eda about 10% data cleansing about

20% feature engineering about

25% selecting a specific algorithm about

10% and then training the model from

that algorithm about 15% and then

finally evaluating the model deciding

which is the best model with the highest

accuracy rate that's about

20% all right so we have reached the

most interesting part of this

presentation which is the demonstration

of an endtoend machine learning workflow

on a real life data set that

demonstrates the use case of predictive

maintenance so the for the data set for

this particular use case I've used a

data set from kegle so for those of you

are not aware of this kegle is the

world's largest open-source Community

for data science and Ai and they have a

large collection of data sets from all

various uh areas of industry and human

endeavor and they also have a large

collection of models that have been

developed using these data sets so here

we have a data set for the particular

use case predictive maintenance okay so

this is some information about the data

set uh so in case um you do not know how

to get to there this is the URL to click

on okay to get to that data set so once

you at the data set here you can or the

page for about this data set you can see

all the information about this data set

and you can download the data set in a

CSV

format okay so let's take a look at the

data set so this data set has a total of

10,000 samples okay and these are the

feature variables the type the product

ID the add temperature process

temperature rotational speed talk tool

Weare and this is the target variable

all right so the target variable is what

we are interested in what we are

interested in using to train the machine

learning model and also what we

interested to predict okay so these are

the feature variables they describe or

they provide information about this

particular machine on the production

line on the assembly line so you might

know the product ID the type the air

temperature process temperature

rotational speed talk to where right so

let's say you've got a iot sensor system

that's basically capturing all this data

about a product or a machine on your

production or assembly line okay and

you've also captured information about

whether is for a specific uh sample

whether that sample uh experien a

failure or not okay so the target value

of zero okay indicates that there's no

failure so zero means no failure and we

can see that the vast majority of data

points in this data set are no failure

and here we can see an example here

where you have a case of a failure so a

failure is marked as a one positive and

no failure is marked as zero negative

all right so here we have one type of a

failure it's called a power failure and

if you scroll down the data set you see

there are also other kinds of failures

like a towar

failure uh we have a over strain failure

here for example

uh we also have a power failure again

and so on so if you scroll down through

these 10,000 data points and or if

you're familiar with using Excel to

filter out values in a colume you can

see that in this particular colume here

which is the so-called Target variable

colume you are going to have the vast

majority of values as zero which means

no failure and some of the rows or the

data points you are going to have a

value of one and for those rows that you

have a value of one for example example

here you are sorry for example here you

are going to have different types of

failure so like I said just now power

failure tool set filia etc etc so we are

going to go through the entire machine

learning workflow process with this data

set so to see an example of that we are

going to use a we're going to go to the

code section here all right so if I

click on the code section here and right

down here we have see what is called a

data set notebook so this is basically a

Jupiter notebook Jupiter is basically an

python application which allows you to

create a python machine learning

program that basically builds your

machine learning model assesses or

evaluates his accuracy and generates

predictions from it okay so here we have

a whole bunch of Jupiter notebooks that

are available and you can select any one

of them all these notebooks are

essentially going to process the data

from this particular data set so if I go

to this code page here I've actually

selected a specific notebook that I'm

going to run through to demonstrate an

endtoend machine learning workflow using

various machine learning libraries from

the Python programming language okay so

the uh particular notebook I'm going to

use is this particular notebook here and

you can also get the URL for that

particular The Notebook from

here okay so let's quickly do a quick

revision again what are we trying to do

here we're trying to build a machine

learning classification model right so

we said there are two primary areas of

supervised learning one is regression

which is used to predict a numerical

Target variable and the second kind of

supervised learning is classification

which is what we're doing here we're

trying to predict a categorical Target

variable okay so in this particular

example we actually have two kinds of

ways we can classify either a binary

classification or a multiclass

classification so for binary

classification we are only going to

classify the product or machine as

either it failed or it did not fail okay

so if we go back to the data set that I

showed you just now if you look at this

target variable colume there are only

two possible values here they either

zero or one zero means there's no fi

one means that's a failure okay so this

is an example of a binary classification

only two possible outcomes zero or one

didn't fail or fail all right two

possible outcomes and then we can also

for the same data set we can extend it

and make it a multiclass classification

problem all right so if we kind of want

to drill down further we can say that

not only is there a failure we can

actually say that are different types of

failures okay so we have one category of

class that is basically no failure okay

then we have a category for the

different types of failures right so you

can have a power failure you could have

a tool Weare

failure uh you could have let's go down

here you could have a over strain

failure and etc etc so you can have

multiple classes of failure in addition

to the general overall or the majority

class of no failure and that would be a

multiclass classification problem so

with this data set we are going to see

how to make it a binary classification

problem and also a multiclass

classification problem okay so let's

look at the workflow so let's say we've

already got the data so right now we do

have the data set this is the data set

that we have so let's assume we've

somehow managed to get this data set

from some iot sensors that are

monitoring realtime data in our

production environment on the assembly

line on the production line we've got

sensors reading data that gives us all

these data that we have in this CSV file

Okay so we've already got the data we've

retrieved the data now we're going to go

on to the cleaning and exploration part

of your machine learning life cycle all

right so let's look at the data cleaning

part so the data cleaning part we

interested in uh checking for missing

values and maybe removing the rows you

missing values okay

uh so the kind of things we can sorry

the kind of things we can do in missing

values we can remove the row missing

values we can put in some new values uh

some replacement values which could be a

average of all the values in that that

particular colume etc etc we also try to

identify outliers in our data set and

also there are a variety of ways to deal

with that so this is called Data

cleansing which is a really important

part of your machine learning workflow

right so that's where we are now at

we're doing cleansing and then we're

going to follow up with

exploration so let's look at the actual

code that does the cleansing here so

here we are right at the start of the uh

machine learning uh life cycle here so

this is a Jupiter notebook so here we

have a brief description of the problem

statement all right so this data set

reflects real life predictive

maintenance enounter industry with

measurements from real equipment the

features description is taken directly

from the data source set so here we have

a description of the six key features in

our data set type which is the quality

of the product the air temperature the

process temperature the rotational speed

the talk and the towar all right so

these are the six feature variables and

there are the two target variables so

just now I showed you just now there's

one target variable which only has two

possible values either zero or one okay

zero or one means failure or no failure

so that will be this colume here right

so let me go all the way back up to here

so this colume here we already saw it

only has two I values is either zero or

one and then we also have this column

here and this column here is basically

the failure type and so the we have as I

already demonstrated just now we do have

uh several categories of or types of

failure and so here we call this

multiclass

classification so we can either build a

binary classification model for this

problem domain or we can build a

multiclass

classification problem all right so this

jupyter notebook is going to demonstrate

both approaches to us so first step we

are going to write all this python code

that's going to import all the libraries

that we need to use okay so this is

basically python code okay and it's

importing the relevant machine learn

oops we are importing the relevant

machine learning libraries related to

our domain use case okay then we load in

our data set okay so this our data set

we describe it we have some quick

insights into the data set um and then

we just take a look at all the variables

of the feature variables Etc and so on

we just what we're doing now is just

doing a quick overview of the data set

so this all this python code here they

were writing is allowing us the data

scientist to get a quick overview of our

data set right okay like how many um V

how many rows are there how many columns

are there what are the data types of the

colums what are the name of the columns

etc etc okay then we zoom in on to the

Target variables so we look at the

Target variables how many uh counts

there are of this target variable uh and

so on how many different types of

failures there are then you want to

check whether there are any

inconsistencies between the Target and

the failure type Etc okay so when you do

all this checking you're going to

discover there are some discrepancies in

your data set so using a specific python

code to do checking you're going to say

hey you know what there's some errors

here right there are nine values that

classify as failure and Target variable

but as no no failure in the failure type

variable so that means there's a

discrepancy in your data point right so

which are so these are all the ones that

are discrepancies because the target

variable says one and we already know

that Target variable one is supposed to

mean that it's a failure right target

varable one is supposed to mean that is

a failure so we are kind of expecting to

see the failure classification but some

rows actually say there's no failure

although the target type is one but here

is a classic example of an error that

can very well Ur in a data set so now

the question is what do you do with

these errors in your data set right so

here the data scientist says I think it

would make sense to remove those

instances and so they write some code

then to remove those instances or those

uh rows or data points from the overall

data set and same thing we can again

check for other ISU so we find there's

another ISU here with our data set which

is another warning so again we can

possibly remove them so you're going to

remove 20 7 instances or rows from your

overall data set so your data set has a

10,000 uh rows or data points you're

removing 27 which is only 0.27 of the

entire data set and these were the

reasons why you remove them okay so if

you're just removing to uh 0.27% of the

anti data set no big deal right still

okay but you needed to remove them

because these errors right this

27 um

errors okay data points with errors in

your data set could really affect the

training of your machine learning model

so we need to do your data cleansing

right so we are actually cleansing now

uh uh some kind of data that is

incorrect or erroneous in your original

data set okay so then we go on to the

next part which is called Eda right so

Eda is where we kind of explore our data

and we want to kind of get a visual

overview of our data as a whole and also

take a look at the statistical

properties of data the statistical

distribution of the data in all the

various colums the correlation between

the variables between the feature

variables different columns and also the

feature variable and the target variable

so all of this is called Eda and Eda in

a machine learning workflow is typically

done through visualization

all right so let's go back here and take

a look right so for example here we are

looking at correlation so we plot the

values of all the various feature

variables against each other and look

for potential correlations and patterns

and so on and all the different shapes

that you see here in this pair plot okay

uh will have different meaning

statistical meaning and so the data

scientist has to kind of visually

inspect this P plot makes some

interpretations of these different

patterns that he sees here all right so

these are some of the insights that that

can be deduced from looking at these

pattern so for example the Tor and

rotational speed are highly correlated

the process temperature and a

temperature so highly correlated that

failures occur for extreme values of

some features etc etc then you can plot

certain kinds of charts this called a

violing chart to again get new insights

for example regarding the talk and

rotational speed it can see again that

most failures are triggered for much

lower or much higher values than the

mean when they're not failing so all

these visualizations they are there and

a trained data scientist can look at

them inspect them and make some kind of

insightful deductions from them okay

percentage of failure right uh the

correlation heat map okay between all

these different feature variables and

also the target

variable okay uh the product types

percentage of product types percentage

of failure with respect to the product

type so we can also kind of visualize

that as well so certain products have a

higher ratio of faure compared to other

product types Etc or for example uh M

tends to feel more than H products etc

etc so we can create a vast variety of

visualizations in the Eda stage so you

can see here and again the idea of this

visualization is just to give us some

insight some preliminary insight into

our data set that helps us to model it

more correctly so some more insights

that we get into our data set from all

this visualization

then we can plot the distribution so we

can see whether it's a normal

distribution or some other kind of

distribution uh we can have a box plot

to see whether there are any outliers in

your data set and so on right so we can

see from the box plots we can see

rotational speed and have outliers so we

already saw outliers are basically a

problem that you may need to kind of

tackle right so outliers are an isue uh

it's a it's a part of data cleansing and

so you may need to tackle this so we may

have to check okay well where are the

potential outliers so we can analyze

them from the box blot okay um but then

we can say well they are outliers but

maybe they're not really horrible

outliers so we can tolerate them or

maybe we want to remove them so we can

see what the mean and maximum values for

all these with respect to product type

how many of them are above or highly

correlated with the product type in

terms of the maximum and minimum okay

and then so on so the Insight is well we

got 4.8% of the instances are outliers

so maybe 4.87% is not really that much

the outliers are not horrible so we just

leave them in the data set now for a

different data set the data scientist

could come to different conclusion so

then they would do whatever they've

deemed is appropriate to kind of cleanse

the data set okay so now that we have

done all the Eda the next thing we're

going to do is we are going to do what

is called feature engineering so we are

going to transform our original feature

variables and these are our original

feature variables right these are our

original feature variables and we are

going to transform them all right we're

going to transform them in some sense uh

into some other form before we fit this

for training into our machine learning

algorithm all right so these are

examples of let's say this example of a

original data set right and this is

examples these are some of the examples

you don't have to use all of them but

these are some of examples of what we

call feature engineering which you can

then transform your original values in

your feature variables to all these

transform values here so we're going to

pretty much do that here so we have a

ordinal encoding we do scaling of the

data so the data set is scaled we use a

minmax scaling and then finally we come

to do a modeling so we have to split our

data set into a training data set and a

test data set so coming back to again um

we said that in a before you train your

model sorry before you train your model

you have to take your original data set

now this is a featured engineered data

set we're going to break it into two or

more subsets okay so one is called the

training data set that we use to Feit

and train a machine learning model the

second is test data set to evaluate the

accuracy of the model okay so we got

this training data set your test data

set and we also need

to sample so from our original data set

we need to sample sample some points

that go into your training data set some

points that go in your test data set so

there are many ways to do sampling one

way is to do stratified sampling where

we ensure the same proportion of data

from each steta or class because right

now we have a multiclass classification

problem so you want to make sure the

same proportion of data from each TR

class is equally proportional in the

training and test data set as the

original data set which is very useful

for dealing with what is called an

imbalanced data set so here we have an

example of what is called an imbalanced

data set in the sense that you have the

vast majority of data points in your

data set they are going to have the

value of zero for their target variable

colume so only a extremely small

minority of the data points in your data

set will actually have the value of one

for their target variable colume okay so

a situation where you have your class or

your target variable colume where the

vast majority of values are from one

class and a tiny small minority are from

another class we call this an imbalanced

data set and for an imbalanced data set

typically we will have a specific

technique to do the train test split

which is called stratified sampling and

so that's what's exactly happening here

we're doing a stratified split here so

we are doing a train test split here uh

and we are doing a stratified split uh

and then now we actually develop the

models so now we've got the train test

plate now here is where we actually

train the models

now in terms of classification there are

a whole bunch of

possibilities right that you can use

there are many many different algorithms

that we can use to create a

classification model so this are an

example of some of the more common ones

logistic support Vector machine decision

trees random Forest bagging balance

bagging boost assemble Ensemble so all

these are different algorithms which

will create different kind of models

which will result in different accuracy

measures okay so it's the goal of the

data scientist to find the best model

that gives the best accuracy for the

given data set for training on that

given data set so let's head back again

to uh our machine learning workflow so

here basically what I'm doing is I'm

creating a whole bunch of models here

all right so one is a random Forest one

is balance bagging one is a boost

classifier one's The Ensemble classifier

and using all of these I am going to

basically Feit or train my model using

all these algorithms and then I'm going

to evaluate them okay I'm going to

evaluate how good each of these models

are and here you can see your value your

evaluation data right okay and this is

the confusion Matrix which is another

way of evaluating so now we come to the

kind of the the the key part here which

is which is how do I distinguish between

all these models right I've got all

these different models which are built

with different algorithms which I'm

using to train on the same data set how

do I distinguish between all these

models okay and so for that sense for

that we actually have a whole bunch of

common evaluation matrics for

classification right so this evaluation

matrics tell us how good a model is in

terms of its accuracy in

classification so in terms of

accuracy we actually have many different

models uh sorry many different measures

right you might think well accuracy is

just accuracy well that's all right it's

just either it's accurate or it's not

accurate right but actually it's not

that simple there are many different

ways to measure the accuracy of a

classification model and these are some

of the more common ones so for example

the confusion metrix tells us how many

true positives that means the value is

positive the prediction is positive how

many false FAL positives which means the

value is negative the machine learning

model predicts positive how many false

negatives which means that the machine

learning model predicts negative but

it's actually positive and how many true

negatives there are which means that the

machine the machine learning model

predicts negative and the true value is

also negative so this is called a

confusion Matrix this is one way we

assess or evaluate the performance of a

classification

model okay this is for binary

classification we can also have

multiclass confusion Matrix

and then we can also measure things like

accuracy so accuracy is the true

positives plus the true negatives which

is the total number of correct

predictions made by the model divided by

the total number of data points in your

data set and then you have also other

kinds of

measures uh such as recall and this is a

formula for recall this is a formula for

the F1 score okay and then there's

something called the uh R curve right so

without going too much in the detail of

what each of these entails essentially

these are all different ways these are

different kpi right just like if you

work in a company you have different kpi

right certain employees have certain kpi

that measures how good or how how uh you

know efficient or how effective a

particular employee is right so the

kpi kpi for your machine learning models

are Roc curve F1 score recall accuracy

okay and your confusion Matrix so so

fundamentally after I have built right

so here I've built my four different

models so after I built these form

different models I'm going to check and

evaluate them using all those different

metrics like for example the F1 score

the Precision score the recall score all

right so for this model I can check out

the ROC score the F1 score the Precision

score the recall score then for this

model this is the ROC score the F1 score

the Precision score the recall called

then for this model and so on so for

every single model I've created using my

training data set I will have all my set

of evaluation metrics that I can use to

evaluate how good this model is okay

same thing here I've got a confusion

Matrix here right so I can use that

again to evaluate between all these four

different models and then I kind of

summarize it up here so we can see from

this summary here that actually the top

two models right which are I'm going to

give a lot as a data scientist I'm now

going to just focus on these two models

so these two models are begging

classifier and random Forest classifier

they have the highest values of F1 score

and the highest values of the rooc curve

score okay so we can say these are the

top two models in terms of accuracy okay

using the fub1 evaluation metric and the

r Au evaluation metric okay so these

results uh kind of summarize here and

then we use different sampling

techniques okay so just now I talked

about um different kinds of sampling

techniques and so the idea of different

kinds of sampling techniques is to just

get a different feel for different

distributions of the data in different

areas of your data set so that you want

to just kind of make sure that your your

your evaluation of accuracy is actually

statistically correct right so we can um

do what is called oversampling and under

sampling which is very useful when

you're working with an imbalance data

set so this is example of doing that and

then here we again again check out the

results for all these different

techniques we use uh the F1 score the Au

score all right these are the two key

measures of accuracy right so and then

we can check out the scores for the

different approaches okay so we can see

oh well overall the models have lower Au

r r Au C score but they have a much

higher F1 score the begging classifier

had the highest R1 highest roc1 score

but F1 score was too low okay then in

the data scientist opinion the random

forest with this particular technique of

sampling has equilibrium between the F1

R F1 R and A score so the takeaway one

is the macro F1 score improves

dramatically using the sampl sampling

techniqu so these models might be better

compared to the balanced ones all right

so based on all this uh evaluation the

data scientist says they're going to

continue to work with these two models

all right and the balance begging one

and then continue to make further

comparisons all right so then we

continue to keep refining on our

evaluation work here we're going to

train the models one more time again so

we again do a training test plate and

then we do that for this particular uh

approach model and then we print out we

print out what is called a

classification report and this is

basically a summary of all those metrics

that I talk about just now so just now

remember I said the the there was

several evaluation metrics right so uh

we had the confusion matrics the

accuracy the Precision the recall the Au

ccore so here with the um classification

report I can get a summary of all of

that so I can see all the values here

okay for this particular model begging

Tomac links and then I can do that for

another model the random Forest

borderline SME and then I can do that

for another model which is the balance

ping so again we see this a lot of

comparison between different models

trying to figure out what all these

evaluation metrics are telling us all

right then again we have a confusion

Matrix so we generate a confusion Matrix

for the bagging with the toac links

under sampling for the random followers

with the borderline mod over sampling

and just balance begging by itself then

again we compare between these three uh

models uh using the confusion Matrix

evaluation Matrix and then we can kind

of come to some conclusions all right so

right so now we look at all the data

then we move on and look at another um

another kind of evaluation metrix which

is the r score right so this is one of

the other evaluation metrics I talk

about so this one is a kind of a curve

you look at it to see the area

underneath the curve this is called AOC

R area under the curve sorry Au Au R

area under the curve all right so the

area under the curve uh

score will give us some idea about the

threshold that we're going to use for

classif ification so we can examine this

for the bagging classifier for the

random forest classifier for the balance

bagging classifier okay then we can also

again do that uh finally we can check

the classification report of this

particular model so we keep doing this

over and over again evaluating this m

The Matrix the the accuracy Matrix the

evaluation Matrix for all these

different models so we keep doing this

over and over again for different

thresholds or for classification and so

as we keep drilling into these we kind

of get more and more understanding of

all these different models which one is

the best one that gives the best

performance for our data set okay so

finally we come to this conclusion this

particular model is not able to reduce

the record on failure test than

95.8% on the other hand balance begging

with a decision thresold of 0.6 is able

to have a better recall blah blah blah

Etc so finally after having done all of

this evalu ations

okay this is the conclusion

so after having gone so right now we

have gone through all the steps of the

Machining learning life cycle and which

means we have right now or the data

scientist right now has gone through all

these

steps uh which is now we have done this

validation so we have done the cleaning

exploration preparation transformation

the future engineering we have developed

and trained multiple models we have

evaluated all these different models so

right now we have reached this stage so

at this stage we as the data scientist

kind of have completed our job so we've

come to some very useful conclusions

which we now can share with our

colleagues all right and based on this

uh conclusions or recommendations

somebody is going to choose a

appropriate model and that model is

going to get deployed for realtime use

in a real life production environment

okay and that decision is going to be

made based on the recommendations coming

from the data scientist at the end of

this phase okay so at the end of this

phase the data scientist is going to

come up with these conclusions so

conclusions is okay if the engineering

team they are looking okay the

engineering team right the engineering

team if they are looking for the highest

failure detection rate possible then

they should go with this particular

model okay

and if they want a balance between

precision and recall then they should

choose between the begging model with a

0.4 decision threshold or the random

forest model with a 0.5 threshold but if

they don't care so much about predicting

every failure and they want the highest

Precision possible then they should opt

for the begging toax link classifier

with a bit higher decision threshold and

so this is the key thing that the data

scientist is going to give right this is

the key takeaway this is the kind of the

end result of the entire machine

learning life cycle right now the data

scientist is going to tell the

engineering team all right you guys

which is more important for you point a

point B or Point C make your decision so

the engineering team will then discuss

among themselves and say hey you know

what what we want is we want to get the

highest failure detection possible

because any kind kind of failure of that

machine or the product on the samply

line is really going to screw us up big

time so what we're looking for is the

model that will give us the highest

failure detection rate we don't care

about Precision but we want to be make

sure that if there's a failure we are

going to catch it right so that's what

they want and so the data scientist will

say Hey you go for the balance begging

model okay then the data scientist saves

this all right uh and then once you have

saved this uh you can then go right

ahead and deploy that so you can go

right ahead and deploy that to

production okay and so if you want to

continue we can actually further

continue this modeling problem so just

now I model this problem as a binary

classification problem uh sorry just I

modeled this problem as a binary

classification which means it's either

zero or one either fail or not fail but

we can also model it as a multiclass

classification problem right because as

as I said earlier just now for the

Target variable colum which is sorry for

the failure type colume you actually

have multiple kinds of failures right

for example you may have a power failure

uh you may have a towar failure uh you

may have a overstrain failure so now we

can model the problem slightly

differently so we can model it as a

multiclass classification problem and

then we go through the entire same

process that we went through just now so

we create different models we test this

out but now the confusion Matrix is for

a multiclass classification isue right

so we're going

to check them out we're going to again

uh try different algorithms or models

again train and test our data set do the

training test split uh on these

different models all right so we have

like for example we have bon random

Forest B random Forest a great search

then you train the models using what is

called hyperparameter tuning then you

get the scores all right so you get the

same evaluation scores again you check

out the evaluation scores compare

between them generate a confusion Matrix

so this is a multiclass confusion Matrix

and then you come to the final

conclusion so now if you are interested

to frame your problem domain as a

multiclass classification problem all

right then these are the recommendations

from the data scientist so the data

scientist will say you know what I'm

going to pick this particular model the

balance backing classifier and these are

all the reasons that the data scientist

is going to give as a rational for

selecting this particular

model and then once that's done you save

the model and that's that's it that's it

so that's all done now and so then the

uh the model the machine learning model

now you can put it live run it on the

server and now the machine learning

model is ready to work which means it's

ready to generate predictions right

that's the main job of the machine

learning model you have picked the best

machine learning model with the best

evaluation metrics for whatever accur

see goal you're trying to achieve and

now you're going to run it on a server

and now you're going to get all this

real time data that's coming from your

sensus you're going to pump that into

your machine learning model your machine

learning model will pump out a whole

bunch of predictions and we're going to

use that predictions in real time to

make real time real world decision

making right you're going to say okay

I'm predicting that that machine is

going to fail on Thursday at 5:00 p.m.

so you better get your service folks in

to service it on Thursday 2: p.m. or you

know whatever so you can you know uh

make decisions on when you want to do

your maintenance you know and and make

the best decisions to optimize the cost

of Maintenance etc etc and then based on

the

results that are coming up from the

predictions so the predictions may be

good the predictions may be lousy the

predictions may be average right so we

are we're constantly monitoring how good

or how useful are the predictions

generated by this realtime model that's

running on the server and based on our

monitoring we will then take some new

data and then repeat this entire life

cycle again so this is basically a

workflow that's iterative and we are

constantly or the data scientist is

constantly getting in all these new data

points and then refining the model

picking maybe a new model deploying the

new model onto the server and so on all

right and so that's it so that is

basically your machine learning workflow

in a nutshell okay so for this

particular approach we have used a bunch

of uh data science libraries from python

so we have used pandas which is the most

B basic data science libraries that

provides all the tools to work with raw

data we have used numai which is a high

performance library for implementing

complex array metrix operations we have

used met plot lip and cbon which is used

for doing the Eda the explorat

exploratory data analysis phase machine

learning where you visualize all your

data we have used psyit learn which is

the machine L learning library to do all

your implementation for all your call

machine learning algorithms uh we we we

have not used this because this is not a

deep learning uh problem but if you are

working with a deep learning problem

like image classification image

recognition object detection okay

natural language processing text

classification well then you're going to

use these libraries from python which is

tensor flow okay and also py

to and then lastly that whole thing that

whole data science project that you saw

just now this entire data science

project is actually developed in

something called a Jupiter notebook so

all this python code along with all the

observations from the data

scientists okay for this entire data

science project was actually run in

something called a Jupiter notebook so

that is uh the

most widely used tool for interactively

developing and presenting data science

projects okay so that brings me to the

end of this entire presentation I hope

that you find it useful for you and that

you can appreciate the importance of

machine learning and how it can be

applied in a real life use case in a

typical production environment all right

thank you all so much for watching