Hello everyone, my name is Victor. I'm
your friendly neighborhood data
scientist from DreamCatcher. So in this
presentation, I would like to talk about
a specific industry use case of AI or
machine learning which is predictive
maintenance. So I will be covering these
topics and feel free to jump forward to
the specific part in the video where I
talk about all these topics. So I'm going
to start off with a general preview of
AI and machine learning. Then, I'll
discuss the use case which is predictive
maintenance. I'll talk about the basics
of machine learning, the workflow of
machine learning, and then we will come
to the meat of this presentation which
is essentially a demonstration of the
machine learning workflow from end to
end on a real life predictive
maintenance domain problem. All right, so
without any further ado, let's jump into
it. So let's start off with a quick
preview of AI and machine learning. Well
AI is a very general term, it encompasses
the entire area of science and
engineering that is related to creating
software programs and machines that
will be capable of performing tasks
that would normally require human
intelligence. But AI is a catchall term,
so really when we talk about apply AI,
how we use AI in our daily work, we are
really going to be talking about machine
learning. So machine learning is the
design and application of software
algorithms that are capable of learning
on their own without any explicit human
intervention. And the primary purpose of
these algorithms are to optimize
performance in a specific task. And the
primary performance or the primary task
that you want to optimize performance in
is to be able to make accurate
predictions about future outcomes based
on the analysis of historical data
from the past. So essentially machine
learning is about making predictions
about the future or what we call
predictive analytics.
And there are many different
kinds of algorithms that are available in
machine learning under the three primary
categories of supervised learning,
unsupervised learning, and reinforcement
learning. And here we can see some of the
different kinds of algorithms and their
use cases in various areas in
industry. So we have various domain use
cases
for all these different kind of
algorithms, and we can see that different
algorithms are fitted for different use cases.
Deep learning is an advanced form
of machine learning that's based on
something called an artificial neural
network or ANN for short, and this
essentially simulates the structure of
the human brain whereby neurons
interconnect and work together to
process and learn new information. So DL
is the foundational technology for most
of the popular AI tools that you
probably have heard of today. So I'm sure
you have heard of ChatGPT if you haven't
been living in a cave for the past 2
years. And yeah, so ChatGPT is an example
of what we call a large language model
and that's based on this technology
called deep learning. Also, all the modern
computer vision applications where a
computer program can classify images or
detect images or recognize images on
its own, okay, we call this computer
vision applications. They also use
this particular form of machine learning
called deep learning, right? So this is a
example of an artificial neural network.
For example, here I have an image of a
bird that's fed into this artificial
neural network, and output from this
artificial neural network is a
classification of this image into one of
these three potential categories. So in
this case if the a Ann has been trained
properly uh we fit in this image this
a&n should correctly classify this image
as a bird right so this is image
classification problem which is a
classic use case for an artificial
neural network in the field of computer
vision and and just like in the case of
machine learning there are a variety of
algorithms uh that are available for
deep learning under the category of
supervised learning and also
unsupervised
learning all right so this is how we can
kind of categorize this you can think of
AI is a general area of Smart Systems
and machine machine learning is
basically applied Ai and deep learning
is a
subspecialization of machine learning
using a particular architecture called
an artificial neural
network and generative AI so if you talk
about chat GPT Okay Google Gemini
Microsoft co-pilot okay all these
examples of generative AI they are
basically large language models and they
are a further subcategory within the
area of deep
learning and there are many applications
of machine learning in Industry right
now so pick which particular industry
you involved in and these are all the
specific areas of
applications right uh so probably I'm
going to guess the vast majority of you
who are watching this video you're
probably coming from the manufacturing
industry and so in the manufacturing
industry some of the standard use cases
for machine learning and deep learning
are predicting potential problems okay
so sometimes you call this uh predictive
maintenance where you want to predict
when a problem is going to happen and
then kind of address it before it
happens and then monitoring systems
automating your manufacturing assembly
line or production line okay smart
scheduling and detecting anomaly on your
production
line okay so let's talk about the use
case here which is predictive
maintenance right so what is predictive
maintenance well predictive maintenance
uh here's the long definition is a
equipment maintenance strategy that
relies on real-time monitoring of
equipment conditions and data to predict
equipment failures in advance
and this uses Advanced Data models
analytics and machine learning whereby
we can reliably assess when failures are
more likely to occur including which
components are more likely to be
affected on your production or assembly
line so where does pred predictive
maintenance fit into the overall scheme
of things right so let's talk about the
kind of standard way that you know
factories or production uh or production
lines assembly lines in factories tend
to handle uh Main maintenance issues say
10 or 20 years ago right so what you
have is the what you would probably
start off is is the most basic mode
which is reactive maintenance so you
just wait until your machine breaks down
and then you repair right the simplest
but of course I'm sure if you work on a
production line for any period of time
you know that this reactive maintenance
can give you a whole bunch of headaches
especially if the machine breaks down
just before a critical delivery dat line
right then you're you're going to have a
backlog of orders and you're going to
run a lot of problems okay so we move on
to PR preventive maintenance which is
you regularly schedule a maintenance of
your production machines uh to reduce
the failure rate so you might do
maintenance once every month once every
two weeks whatever okay this is great
but the problem of course then is well
sometimes you're doing too much
maintenance it's not really necessary
and it still doesn't totally uh prevent
this uh you know uh a failure of the
machine that occurs outside of your plan
maintenance right so a bit of
improvement but not not that much better
and then these last two categories is
where we bring in Ai and machine
learning so with machine learning we're
going to use sensors to do real-time
monitoring of the data and then using
that data we're going to build a machine
learning model which helps us to predict
with a reasonable level of accuracy when
the next failure is going to happen on
your assembly or production line on a
specific component or specific machine
right so you just want to be predict to
to a high level of accuracy like maybe
to the specific day even the specific
hour or even minute itself when you
expect that particular product to fail
or the particular machine to fail all
right so these are advantages of
predictive maintenance it minimizes
occurrence of unscheduled downtime it
gives you a realtime overview of your
current condition of assets ensures
minimal disruptions of productivity uh
optimizes time span on maintenance work
optimizes the use of spare parts and so
on and of course there are some
disadvantages with which is uh the
primary one you need a specialized set
of skills among your engineers to
understand and create machine learning
models that can work on the realtime
data that you're getting okay so we're
going to take a look at some real life
use cases so these are a bunch of links
here so if you navigate to these links
here you'll be able to get a look at
some real life use cases of um machine
learning uh in predictive maintenance so
the IBM website okay gives you a look at
a bunch of five use cases so you can
click on these links and follow up with
them if you want to read more okay this
is Waste Management manufacturing okay
Building Services and renewable energy
and also mining right so these are all
use cases if you want to know more about
them you can read up and follow them
from this website uh and this website
gives uh this is a pretty good website I
would really encourage you to just look
through this if you're interested in
predictive maintenance so here it tells
you about you know an industry survey of
predictive maintenance we can see that a
large portion of the industry
manufacturing industry agreed that
predictive maintenance is a real need to
stay competitive uh and predictive
maintenance is essential for
manufacturing industry and will gain
additional strength in the future so
this is a survey that was done um quite
some time ago and this was the results
that we got back so we can see the vast
majority of key industry players in the
manufacturing sector they consider
predictive maintenance to be very
important
um activity that they want to
incorporate into their workflow right
and we can see here the kind of Roi that
we expect on investment in predictive
maintenance so 45% reduction in downtime
25% growth in productivity 75% fault
elimination 30% reduction in maintenance
cost okay and best of all if you really
want to kind of take a look at examples
all right so there are all these
different companies that have uh
significantly invested in predictive
maintenance technology in their
manufacturing processes So pepsic Co we
have got uh Frito General motos Mii EOP
plan all right so you can jump over here
and and take a look at some of these uh
use cases let me perhaps let me try and
open this up for example Mii right you
can see Mii has impl oops Mii has used
uh this particular piece of software
called ma lab all right or math work
sorry uh to do uh predictive maintenance
for their manufacturing processes using
machine learning and we can talk you can
study how they have used it all right
and how it works what was their
challenge all right the problems that
were facing the solution that they use
using this MathWorks Consulting piece of
software and data that they collected in
a uh matlb database all right uh sorry
in a Oracle databas uh Oracle database
so using math works from math lab all
right they were able to create a deep
learning model to to to you know to
solve this particular issue for their
domain so if you're interested please I
strongly encourage you to read up on all
these real life customer Stories We
showcase uh use cases for predictive
maintenance okay so that's it for uh
real life use cases for predictive
maintenance now in this uh topic I'm I'm
going to talk about machine learning
Basics so what is is actually involved
in machine learning and I'm going to
give a very quick fast conceptual high
level overview of machine learning all
right so there are several categories of
machine learning supervised unsupervised
semi-supervised reinforcement and deep
learning okay and let's talk about the
most common and widely used category of
machine learning which is called
supervised learning so the par um use
case here that I'm going to be
discussing predictive maintenance it's
basically a form of supervised learning
so how does supervised learning work
well in supervised learning you're going
to create a machine learning model by
providing what is called a labeled data
set as a input to a machine learning
program or algorithm and this data set
is going to contain what is called an
independent of feature variables all
right so this will be a set of variables
and there will be one dependent or
Target variable which we also call the
label and the idea is that the
independent or the feature variable are
the attributes or properties of your
data set that influence the dependent or
the target variable okay so this process
that I've just described is called
training the machine learning model and
the model is fundamentally a
mathematical function that best
approximates the relationship between
the independent variables and the
dependent variable all right so there's
quite a bit of a mouthful so let's jump
into a diagram that maybe illustrates
this more clearly so let's say you have
a data set here an Excel spreadsheet
right and this Excel spreadsheet has a
bunch of columns here and a bunch of
rows okay so these rows here represent
observations or or these rows are what
we call observations or samples or data
points in our data set okay so let's
assume this data set is uh gathered by a
marketing manager at a mall at a retail
mall all right so they've got all this
uh information about the customers who
purchase products at this mall all right
so some of the information they've
gotten about the customers are their
gender their age their income and the
number of children so all this
information about the customers we call
this the independent of the feature
variables all right and based on all
this information about the customer we
also managed to get some or we record
the information about how much the
customer spends all right so this uh
information or this numbers here we call
this the target variable or the
dependent variable right so on the
single Row the data point one single one
single data point contains all the data
for the feature variables and one single
value for the label or the target
variable okay and the primary purpose of
the machine learning model is to create
a mapping from all your feature
variables to your target variable so
somehow there's going to be a function
okay this will be a mathematical
function that maps all the values of
your feature variable to the value of
your target variable in other words this
function represents the relationship
between your future variables and your
target variable okay so this whole thing
this training process we call this the
fitting the model and the target
variable or the label this thing here
this colume here or the values here
these are critical for providing a
context to do the fitting of the
training of the model and once you've
got a trained and fitted model you can
then use the model to make an accurate
prediction of Target values
corresponding to new feature values that
the model has yet to encounter or yet to
see and this as I've already said
earlier this is called Predictive
Analytics okay so let's see what's
actually happening here you take your
training data all right so this is this
whole bunch of data this data set here
cons consisting of a thousand rows of
data 10,000 rows of data you take this
entire data set all right this entire
data set you jam it into your machine
learning algorithm and a couple of hours
later your machine learning algorithm
comes out with a model and the model is
essentially a function that maps all
your feature variables which is these
four columns here to your target
variable which is this one single colume
here okay so once you have the model you
can put in a new data point so basically
the new data point represents data about
new customer a new customer that you
have never seen before so let's say
you've already got information about
10,000 customers that have visited this
mall and how much each of these 10,000
customers have spent when they at this
mall so now you have a totally new
customer that comes in the mall this
customer has never come into this mall
before and what we know about this
customer is that he is a male the age is
50 the income is 18 and they have nine
children so now when you take this data
and you pump that into your model your
model is going to make a prediction it's
going to say hey you know what based on
everything that have been trained before
and based on the model I've developed
I am going to predict that a customer
that is of a male gender of the age 50
with the income of 18 and nine children
that customer is going to spend 25 ring
at the mall and this is it this is what
you want right right there right here
can you see here that is the final
output of your machine learning model
it's going to make a prediction about
something that it has not ever seen
before okay that is the core this is
essentially the core of machine learning
Predictive Analytics making prediction
about the
future based on a historical data
set okay so there are two areas of
supervised learning regression and
classification so regression is used to
predict a numerical Target variable such
as the price of a house or the salary of
an employee whereas classification is
used to predict a c categorical Target
variable or class label okay so for
classification you can have either
binary or multiclass so for example
binary will be just true or false zero
or one so whether your machine is going
to fail or is it not going to fail right
so just two classes two possible
outcomes or is the customer going to
make a purchase or is the customer not
going to make a purchase uh we call this
binary classification and then for
multiclass when there are more than two
classes or types of values so for
example here this would be a
classification problem so if you have a
data set here you've got information
about your customers you've got your
gender of the customer the age of the
customer the salary of the customer and
you also have record about whether the
customer made a purchase or not okay so
you can take this data set to train a
classification model and then the
classification model can then make a
prediction about a new customer and
they're going to predict zero which
means the customer didn't make a
purchase or one which means the customer
make a purchase right and regression
this is regression so let's say you want
to predict the wind speed and you've got
historical data about all these four
other independent variables or feature
variables so you have recorded
temperature the pressure the relative
humidity and the wind direction for the
past 10 days 15 days or whatever okay so
now you are going to train your machine
learning model using this data set and
and the target variable colume okay this
colume here the label is basically a
number right so now with this number
this is a regression model and so now
you can put in a new data point so a new
data point means a new set of values for
temperature pressure relative humidity
and wind direction and your machine
learning model will then predict the
wind speed for that new data point okay
so that's a regression model
all right so in this particular topic
I'm going to talk about the workflow of
that's involved in machine learning so
in the previous um slides I talked about
developing the model all right but
that's just one part of the entire
workflow so in real life when you use
machine learning there's an endtoend
workflow that's involved so the first
thing of course is you need to get your
data and then you need to clean your
data and then you need to exploore your
data you need to see what's going on in
your data set right and your data set
real life data sets are not trivial they
are hundreds of rows thousands of rows
sometimes millions of rows billions of
rows we're talking about billions or
millions of data points especially if
you're using an iot sensor to get data
in real time so you've got all these
super large data sets you need to clean
them and explore them and then you need
to prepare them into a right format so
that you can put them into the training
process to create your machine learning
model and then subsequently you check
how good is the model right how accurate
is the model in terms of its ability to
generate predictions or or for the
future right how accurate are the
predictions that are coming up from your
machine learning model so that's
validating or evaluating your model and
then subsequently if you determine that
your model is of adequate accuracy to
meet whatever your domain use case
requirements are right so let's say the
accuracy that's required for your domain
use case is
85% okay if my machine learning model
can give an 85% accuracy rate I think
it's good enough then I'm going to
deploy it into rail world use case so
here the machine learning model gets uh
deployed on the server and then um other
you know other data sources are going to
be captured from somewhere that data is
pump into the machine learning model the
machine learning model generates
predictions and those predictions are
then used to make decisions on the
factory floor in real time or in any
other particular scenario and then you
constantly Monitor and update the model
you get more new data and then the
entire cycle repeats itself so that's
your machine learning workflow okay in a
nutshell uh here's another example of
this the same thing maybe in a slightly
different format so again you have your
data collection and preparation here we
talk more about the different kinds of
algorithms that available to create a
model and I'll talk about this more in
detail when we look at the real world
example of a endtoend machine learning
workflow for the predictive maintenance
use case so once you have chosen the
appropriate algorithm you then have
trained your model you then have
selected the appropriate train model
among the multiple models you you are
probably going to develop multiple
models from multiple algorithms you're
going to evaluate them all and then
you're going to say hey you know what
after I've evaluated and tested that
I've chosen the best model I'm going to
deploy the model all right so this is
for Real Life production use okay real
life sensor data is going to be pumped
into my model my model is going to
generate predictions the predicted data
is going to used immediately in real
time for real life decision making and
then I'm going to monitor right the
results so somebody's using the
predictions from my model if the
predictions are lousy that goes into the
monitoring the monitoring system
captures that if the predictions are
fantastic well that also captured by the
monitoring system system and that gets
feedback again to the next cycle of my
machine learning
pipeline okay so that's the kind of
overall View and here are the kind of
key phases of your workflow so one of
the important phases is called Eda
exploratory data analysis and in this
particular uh phase uh you're going to
do a lot of stuff primarily just to
understand your data set so like I said
real life data sets they tend to be very
complex and they tend to have various
statistical properties all right
statistics is a very important component
of machine learning so an Eda helps you
to kind of get an overview of your data
set get an overview of any problems in
your data set like any data that's
missing the statistical properties your
data set the distribution of your data
set the statistical correlation of
variables in your data set etc
etc okay then we have data cleaning or
sometimes you call it data cleansing and
in this phase what you want to do is
primarily you want to kind of do things
like remove duplicate records or rows in
your table you want to make sure that
there I your your data or your data
points your samples have appropriate IDs
and most importantly you want to make
sure there's not too many missing values
in your data set so what I mean by
missing values are things like that
right you have got a data set and for
some reason there are some cells or
locations in your data set which are
missing values right and if you have a
lot of these missing values then you've
got a poor quality data set and you're
not going to be able to build a good
model from this data set you're not
going to be able to train a good machine
learning model from a data set with a
lot of missing values like this so you
have to figure out whether there are a
lot of missing values in your data set
how do you handle them another thing
that's important in data cleansing is
figuring out the outliers in your data
set so uh outliers are things like this
you know data points are very far from
the general trend of data points in your
data set right and and so there are also
several ways to detect outliers in your
data set and there are several ways to
handle outliers in your data set
similarly as well there are several ways
to handle missing values in your data
set so handling missing values handling
outliers those are really two very key
importance of data
cleansing and there are many many
techniques to handle this so a data
scientist needs to be acquainted with
all of this all right why do I need to
do data cleansing well here is the key
point
if you have a very poor quality data set
which means youve got a lot of outliers
which are errors in your data set or you
got a lot of missing values in your data
set even though youve got a fantastic
algorithm you've got a fantastic model
the predictions that your model is going
to give is absolutely rubbish it's kind
of like taking water and putting water
into the tank of a mercedesbenz so
Mercedes-Benz is a great car but if you
take water and put it into your
mercedes-ben it will just die right your
car will just die can't run on on water
right on the other hand if you have a
myv myv is just a lousy car but if
you take a high octane good Patrol and
you point to a MV the MV will just go at
you know 100 Mil hour it which just
completely destroy the Mercedes-Benz in
terms of performance so it doesn't it
doesn't really matter what model you're
using right so you can be using the most
Fantastic Model like the the
mercedesbenz or machine learning but if
your data is lousy quality your
predictions is also going to be rubbish
okay so cleansing data set is in fact
probably the most important thing that
data scientists need to do and that's
what they spend most of the time doing
right building the model trading the
model getting the right algorithms and
so on that's really a small portion of
the actual machine learning workflow
right the actual uh machine learning
workflow the vast majority of time is on
cleaning and organizing your
data then you have something called
feature engineering which is you
pre-process the feature variables of
your original data set prior to using
them to train the model and this is
either through addition deletion
combination or transformation of these
variables and then the idea is you want
to improve the predictive accuracy of
the model and also because some models
can only work with numeric data so you
need to transform categorical data into
numeric data all right so just now um in
the earlier slides I showed you that you
take your original data set you pum it
into algorithm and then couple of hours
later you get a machine learning model
right so you didn't do anything to your
data set to the feature variables in
your data set before you pump it into a
machine machine learning algorithm so
what I showed you earlier is you just
take the data set exactly as it is and
you just pump it into the algorithm
couple of hours later you get the model
right uh but that's not what generally
happens in in real life in real life
you're going to take all the original
feature variables from your data set and
you're going to transform them in some
way so you can see here these are the
colums of data from my original data set
and before I actually put all these data
points from my original data set into my
algorithm to train and get my model I
will actually transform them okay so the
transformation of these feature variable
values we call this feature engineering
and there are many many techniques to do
feature engineering so one hot encoding
scaling log transformation descri
discretization date extraction Boolean
logic etc
etc okay then finally we do something
called a train test plate so where we
take our original data set right so this
was the original data set and we break
it into two parts so one is called the
training data set and the other is
called the test data set and the primary
purpose for this is when we feed and
train the machine learning model we're
going to use what is called the training
data set and we when we want to evaluate
the accuracy of the model right so this
is the key part of your machine learning
life cycle because you are not only just
going to have one possible models
because there are a vast range of
algorithms that you can use to create a
model so fundamentally you have a wide
range of choices right like wide range
of cars right you want to buy a car you
can buy buy a myv you can buy a paroda
you can buy a Honda you can buy a
mercedesbenz you can buy a Audi you can
buy a beamer many many different cars
you that available for you if you want
to buy a car right same thing with a
machine learning model that are aast
variety of algorithms that you can
choose from in order to create a model
and so once you create a model from a
given algorithm you need to say hey how
accurate is this model that have created
from this algorithm and and different
algorithms are going to create different
models with different rates of accuracy
and so the primary purpose of the test
data set is to evaluate the ACC accuracy
of the model to see hey is this model
that I've created using this algorithm
is it adequate for me to use in a real
life production use case Okay so that's
what it's all about okay so this is my
original data set I break it into my
feature data uh feature data set and
also my target variable colum so my
feature variable uh colums the target
variable colums and then I further break
it into a training data set and a test
data set the training data set is to use
the train to create the machine learning
model and then once the machine learning
model is created I then use the test
data set to evaluate the accuracy of the
machine learning
model all right and then finally we can
see what are the different parts or
aspects that go into a successful model
so Eda about 10% data cleansing about
20% feature engineering about
25% selecting a specific algorithm about
10% and then training the model from
that algorithm about 15% and then
finally evaluating the model deciding
which is the best model with the highest
accuracy rate that's about
20% all right so we have reached the
most interesting part of this
presentation which is the demonstration
of an endtoend machine learning workflow
on a real life data set that
demonstrates the use case of predictive
maintenance so the for the data set for
this particular use case I've used a
data set from kegle so for those of you
are not aware of this kegle is the
world's largest open-source Community
for data science and Ai and they have a
large collection of data sets from all
various uh areas of industry and human
endeavor and they also have a large
collection of models that have been
developed using these data sets so here
we have a data set for the particular
use case predictive maintenance okay so
this is some information about the data
set uh so in case um you do not know how
to get to there this is the URL to click
on okay to get to that data set so once
you at the data set here you can or the
page for about this data set you can see
all the information about this data set
and you can download the data set in a
CSV
format okay so let's take a look at the
data set so this data set has a total of
10,000 samples okay and these are the
feature variables the type the product
ID the add temperature process
temperature rotational speed talk tool
Weare and this is the target variable
all right so the target variable is what
we are interested in what we are
interested in using to train the machine
learning model and also what we
interested to predict okay so these are
the feature variables they describe or
they provide information about this
particular machine on the production
line on the assembly line so you might
know the product ID the type the air
temperature process temperature
rotational speed talk to where right so
let's say you've got a iot sensor system
that's basically capturing all this data
about a product or a machine on your
production or assembly line okay and
you've also captured information about
whether is for a specific uh sample
whether that sample uh experien a
failure or not okay so the target value
of zero okay indicates that there's no
failure so zero means no failure and we
can see that the vast majority of data
points in this data set are no failure
and here we can see an example here
where you have a case of a failure so a
failure is marked as a one positive and
no failure is marked as zero negative
all right so here we have one type of a
failure it's called a power failure and
if you scroll down the data set you see
there are also other kinds of failures
like a towar
failure uh we have a over strain failure
here for example
uh we also have a power failure again
and so on so if you scroll down through
these 10,000 data points and or if
you're familiar with using Excel to
filter out values in a colume you can
see that in this particular colume here
which is the so-called Target variable
colume you are going to have the vast
majority of values as zero which means
no failure and some of the rows or the
data points you are going to have a
value of one and for those rows that you
have a value of one for example example
here you are sorry for example here you
are going to have different types of
failure so like I said just now power
failure tool set filia etc etc so we are
going to go through the entire machine
learning workflow process with this data
set so to see an example of that we are
going to use a we're going to go to the
code section here all right so if I
click on the code section here and right
down here we have see what is called a
data set notebook so this is basically a
Jupiter notebook Jupiter is basically an
python application which allows you to
create a python machine learning
program that basically builds your
machine learning model assesses or
evaluates his accuracy and generates
predictions from it okay so here we have
a whole bunch of Jupiter notebooks that
are available and you can select any one
of them all these notebooks are
essentially going to process the data
from this particular data set so if I go
to this code page here I've actually
selected a specific notebook that I'm
going to run through to demonstrate an
endtoend machine learning workflow using
various machine learning libraries from
the Python programming language okay so
the uh particular notebook I'm going to
use is this particular notebook here and
you can also get the URL for that
particular The Notebook from
here okay so let's quickly do a quick
revision again what are we trying to do
here we're trying to build a machine
learning classification model right so
we said there are two primary areas of
supervised learning one is regression
which is used to predict a numerical
Target variable and the second kind of
supervised learning is classification
which is what we're doing here we're
trying to predict a categorical Target
variable okay so in this particular
example we actually have two kinds of
ways we can classify either a binary
classification or a multiclass
classification so for binary
classification we are only going to
classify the product or machine as
either it failed or it did not fail okay
so if we go back to the data set that I
showed you just now if you look at this
target variable colume there are only
two possible values here they either
zero or one zero means there's no fi
one means that's a failure okay so this
is an example of a binary classification
only two possible outcomes zero or one
didn't fail or fail all right two
possible outcomes and then we can also
for the same data set we can extend it
and make it a multiclass classification
problem all right so if we kind of want
to drill down further we can say that
not only is there a failure we can
actually say that are different types of
failures okay so we have one category of
class that is basically no failure okay
then we have a category for the
different types of failures right so you
can have a power failure you could have
a tool Weare
failure uh you could have let's go down
here you could have a over strain
failure and etc etc so you can have
multiple classes of failure in addition
to the general overall or the majority
class of no failure and that would be a
multiclass classification problem so
with this data set we are going to see
how to make it a binary classification
problem and also a multiclass
classification problem okay so let's
look at the workflow so let's say we've
already got the data so right now we do
have the data set this is the data set
that we have so let's assume we've
somehow managed to get this data set
from some iot sensors that are
monitoring realtime data in our
production environment on the assembly
line on the production line we've got
sensors reading data that gives us all
these data that we have in this CSV file
Okay so we've already got the data we've
retrieved the data now we're going to go
on to the cleaning and exploration part
of your machine learning life cycle all
right so let's look at the data cleaning
part so the data cleaning part we
interested in uh checking for missing
values and maybe removing the rows you
missing values okay
uh so the kind of things we can sorry
the kind of things we can do in missing
values we can remove the row missing
values we can put in some new values uh
some replacement values which could be a
average of all the values in that that
particular colume etc etc we also try to
identify outliers in our data set and
also there are a variety of ways to deal
with that so this is called Data
cleansing which is a really important
part of your machine learning workflow
right so that's where we are now at
we're doing cleansing and then we're
going to follow up with
exploration so let's look at the actual
code that does the cleansing here so
here we are right at the start of the uh
machine learning uh life cycle here so
this is a Jupiter notebook so here we
have a brief description of the problem
statement all right so this data set
reflects real life predictive
maintenance enounter industry with
measurements from real equipment the
features description is taken directly
from the data source set so here we have
a description of the six key features in
our data set type which is the quality
of the product the air temperature the
process temperature the rotational speed
the talk and the towar all right so
these are the six feature variables and
there are the two target variables so
just now I showed you just now there's
one target variable which only has two
possible values either zero or one okay
zero or one means failure or no failure
so that will be this colume here right
so let me go all the way back up to here
so this colume here we already saw it
only has two I values is either zero or
one and then we also have this column
here and this column here is basically
the failure type and so the we have as I
already demonstrated just now we do have
uh several categories of or types of
failure and so here we call this
multiclass
classification so we can either build a
binary classification model for this
problem domain or we can build a
multiclass
classification problem all right so this
jupyter notebook is going to demonstrate
both approaches to us so first step we
are going to write all this python code
that's going to import all the libraries
that we need to use okay so this is
basically python code okay and it's
importing the relevant machine learn
oops we are importing the relevant
machine learning libraries related to
our domain use case okay then we load in
our data set okay so this our data set
we describe it we have some quick
insights into the data set um and then
we just take a look at all the variables
of the feature variables Etc and so on
we just what we're doing now is just
doing a quick overview of the data set
so this all this python code here they
were writing is allowing us the data
scientist to get a quick overview of our
data set right okay like how many um V
how many rows are there how many columns
are there what are the data types of the
colums what are the name of the columns
etc etc okay then we zoom in on to the
Target variables so we look at the
Target variables how many uh counts
there are of this target variable uh and
so on how many different types of
failures there are then you want to
check whether there are any
inconsistencies between the Target and
the failure type Etc okay so when you do
all this checking you're going to
discover there are some discrepancies in
your data set so using a specific python
code to do checking you're going to say
hey you know what there's some errors
here right there are nine values that
classify as failure and Target variable
but as no no failure in the failure type
variable so that means there's a
discrepancy in your data point right so
which are so these are all the ones that
are discrepancies because the target
variable says one and we already know
that Target variable one is supposed to
mean that it's a failure right target
varable one is supposed to mean that is
a failure so we are kind of expecting to
see the failure classification but some
rows actually say there's no failure
although the target type is one but here
is a classic example of an error that
can very well Ur in a data set so now
the question is what do you do with
these errors in your data set right so
here the data scientist says I think it
would make sense to remove those
instances and so they write some code
then to remove those instances or those
uh rows or data points from the overall
data set and same thing we can again
check for other ISU so we find there's
another ISU here with our data set which
is another warning so again we can
possibly remove them so you're going to
remove 20 7 instances or rows from your
overall data set so your data set has a
10,000 uh rows or data points you're
removing 27 which is only 0.27 of the
entire data set and these were the
reasons why you remove them okay so if
you're just removing to uh 0.27% of the
anti data set no big deal right still
okay but you needed to remove them
because these errors right this
27 um
errors okay data points with errors in
your data set could really affect the
training of your machine learning model
so we need to do your data cleansing
right so we are actually cleansing now
uh uh some kind of data that is
incorrect or erroneous in your original
data set okay so then we go on to the
next part which is called Eda right so
Eda is where we kind of explore our data
and we want to kind of get a visual
overview of our data as a whole and also
take a look at the statistical
properties of data the statistical
distribution of the data in all the
various colums the correlation between
the variables between the feature
variables different columns and also the
feature variable and the target variable
so all of this is called Eda and Eda in
a machine learning workflow is typically
done through visualization
all right so let's go back here and take
a look right so for example here we are
looking at correlation so we plot the
values of all the various feature
variables against each other and look
for potential correlations and patterns
and so on and all the different shapes
that you see here in this pair plot okay
uh will have different meaning
statistical meaning and so the data
scientist has to kind of visually
inspect this P plot makes some
interpretations of these different
patterns that he sees here all right so
these are some of the insights that that
can be deduced from looking at these
pattern so for example the Tor and
rotational speed are highly correlated
the process temperature and a
temperature so highly correlated that
failures occur for extreme values of
some features etc etc then you can plot
certain kinds of charts this called a
violing chart to again get new insights
for example regarding the talk and
rotational speed it can see again that
most failures are triggered for much
lower or much higher values than the
mean when they're not failing so all
these visualizations they are there and
a trained data scientist can look at
them inspect them and make some kind of
insightful deductions from them okay
percentage of failure right uh the
correlation heat map okay between all
these different feature variables and
also the target
variable okay uh the product types
percentage of product types percentage
of failure with respect to the product
type so we can also kind of visualize
that as well so certain products have a
higher ratio of faure compared to other
product types Etc or for example uh M
tends to feel more than H products etc
etc so we can create a vast variety of
visualizations in the Eda stage so you
can see here and again the idea of this
visualization is just to give us some
insight some preliminary insight into
our data set that helps us to model it
more correctly so some more insights
that we get into our data set from all
this visualization
then we can plot the distribution so we
can see whether it's a normal
distribution or some other kind of
distribution uh we can have a box plot
to see whether there are any outliers in
your data set and so on right so we can
see from the box plots we can see
rotational speed and have outliers so we
already saw outliers are basically a
problem that you may need to kind of
tackle right so outliers are an isue uh
it's a it's a part of data cleansing and
so you may need to tackle this so we may
have to check okay well where are the
potential outliers so we can analyze
them from the box blot okay um but then
we can say well they are outliers but
maybe they're not really horrible
outliers so we can tolerate them or
maybe we want to remove them so we can
see what the mean and maximum values for
all these with respect to product type
how many of them are above or highly
correlated with the product type in
terms of the maximum and minimum okay
and then so on so the Insight is well we
got 4.8% of the instances are outliers
so maybe 4.87% is not really that much
the outliers are not horrible so we just
leave them in the data set now for a
different data set the data scientist
could come to different conclusion so
then they would do whatever they've
deemed is appropriate to kind of cleanse
the data set okay so now that we have
done all the Eda the next thing we're
going to do is we are going to do what
is called feature engineering so we are
going to transform our original feature
variables and these are our original
feature variables right these are our
original feature variables and we are
going to transform them all right we're
going to transform them in some sense uh
into some other form before we fit this
for training into our machine learning
algorithm all right so these are
examples of let's say this example of a
original data set right and this is
examples these are some of the examples
you don't have to use all of them but
these are some of examples of what we
call feature engineering which you can
then transform your original values in
your feature variables to all these
transform values here so we're going to
pretty much do that here so we have a
ordinal encoding we do scaling of the
data so the data set is scaled we use a
minmax scaling and then finally we come
to do a modeling so we have to split our
data set into a training data set and a
test data set so coming back to again um
we said that in a before you train your
model sorry before you train your model
you have to take your original data set
now this is a featured engineered data
set we're going to break it into two or
more subsets okay so one is called the
training data set that we use to Feit
and train a machine learning model the
second is test data set to evaluate the
accuracy of the model okay so we got
this training data set your test data
set and we also need
to sample so from our original data set
we need to sample sample some points
that go into your training data set some
points that go in your test data set so
there are many ways to do sampling one
way is to do stratified sampling where
we ensure the same proportion of data
from each steta or class because right
now we have a multiclass classification
problem so you want to make sure the
same proportion of data from each TR
class is equally proportional in the
training and test data set as the
original data set which is very useful
for dealing with what is called an
imbalanced data set so here we have an
example of what is called an imbalanced
data set in the sense that you have the
vast majority of data points in your
data set they are going to have the
value of zero for their target variable
colume so only a extremely small
minority of the data points in your data
set will actually have the value of one
for their target variable colume okay so
a situation where you have your class or
your target variable colume where the
vast majority of values are from one
class and a tiny small minority are from
another class we call this an imbalanced
data set and for an imbalanced data set
typically we will have a specific
technique to do the train test split
which is called stratified sampling and
so that's what's exactly happening here
we're doing a stratified split here so
we are doing a train test split here uh
and we are doing a stratified split uh
and then now we actually develop the
models so now we've got the train test
plate now here is where we actually
train the models
now in terms of classification there are
a whole bunch of
possibilities right that you can use
there are many many different algorithms
that we can use to create a
classification model so this are an
example of some of the more common ones
logistic support Vector machine decision
trees random Forest bagging balance
bagging boost assemble Ensemble so all
these are different algorithms which
will create different kind of models
which will result in different accuracy
measures okay so it's the goal of the
data scientist to find the best model
that gives the best accuracy for the
given data set for training on that
given data set so let's head back again
to uh our machine learning workflow so
here basically what I'm doing is I'm
creating a whole bunch of models here
all right so one is a random Forest one
is balance bagging one is a boost
classifier one's The Ensemble classifier
and using all of these I am going to
basically Feit or train my model using
all these algorithms and then I'm going
to evaluate them okay I'm going to
evaluate how good each of these models
are and here you can see your value your
evaluation data right okay and this is
the confusion Matrix which is another
way of evaluating so now we come to the
kind of the the the key part here which
is which is how do I distinguish between
all these models right I've got all
these different models which are built
with different algorithms which I'm
using to train on the same data set how
do I distinguish between all these
models okay and so for that sense for
that we actually have a whole bunch of
common evaluation matrics for
classification right so this evaluation
matrics tell us how good a model is in
terms of its accuracy in
classification so in terms of
accuracy we actually have many different
models uh sorry many different measures
right you might think well accuracy is
just accuracy well that's all right it's
just either it's accurate or it's not
accurate right but actually it's not
that simple there are many different
ways to measure the accuracy of a
classification model and these are some
of the more common ones so for example
the confusion metrix tells us how many
true positives that means the value is
positive the prediction is positive how
many false FAL positives which means the
value is negative the machine learning
model predicts positive how many false
negatives which means that the machine
learning model predicts negative but
it's actually positive and how many true
negatives there are which means that the
machine the machine learning model
predicts negative and the true value is
also negative so this is called a
confusion Matrix this is one way we
assess or evaluate the performance of a
classification
model okay this is for binary
classification we can also have
multiclass confusion Matrix
and then we can also measure things like
accuracy so accuracy is the true
positives plus the true negatives which
is the total number of correct
predictions made by the model divided by
the total number of data points in your
data set and then you have also other
kinds of
measures uh such as recall and this is a
formula for recall this is a formula for
the F1 score okay and then there's
something called the uh R curve right so
without going too much in the detail of
what each of these entails essentially
these are all different ways these are
different kpi right just like if you
work in a company you have different kpi
right certain employees have certain kpi
that measures how good or how how uh you
know efficient or how effective a
particular employee is right so the
kpi kpi for your machine learning models
are Roc curve F1 score recall accuracy
okay and your confusion Matrix so so
fundamentally after I have built right
so here I've built my four different
models so after I built these form
different models I'm going to check and
evaluate them using all those different
metrics like for example the F1 score
the Precision score the recall score all
right so for this model I can check out
the ROC score the F1 score the Precision
score the recall score then for this
model this is the ROC score the F1 score
the Precision score the recall called
then for this model and so on so for
every single model I've created using my
training data set I will have all my set
of evaluation metrics that I can use to
evaluate how good this model is okay
same thing here I've got a confusion
Matrix here right so I can use that
again to evaluate between all these four
different models and then I kind of
summarize it up here so we can see from
this summary here that actually the top
two models right which are I'm going to
give a lot as a data scientist I'm now
going to just focus on these two models
so these two models are begging
classifier and random Forest classifier
they have the highest values of F1 score
and the highest values of the rooc curve
score okay so we can say these are the
top two models in terms of accuracy okay
using the fub1 evaluation metric and the
r Au evaluation metric okay so these
results uh kind of summarize here and
then we use different sampling
techniques okay so just now I talked
about um different kinds of sampling
techniques and so the idea of different
kinds of sampling techniques is to just
get a different feel for different
distributions of the data in different
areas of your data set so that you want
to just kind of make sure that your your
your evaluation of accuracy is actually
statistically correct right so we can um
do what is called oversampling and under
sampling which is very useful when
you're working with an imbalance data
set so this is example of doing that and
then here we again again check out the
results for all these different
techniques we use uh the F1 score the Au
score all right these are the two key
measures of accuracy right so and then
we can check out the scores for the
different approaches okay so we can see
oh well overall the models have lower Au
r r Au C score but they have a much
higher F1 score the begging classifier
had the highest R1 highest roc1 score
but F1 score was too low okay then in
the data scientist opinion the random
forest with this particular technique of
sampling has equilibrium between the F1
R F1 R and A score so the takeaway one
is the macro F1 score improves
dramatically using the sampl sampling
techniqu so these models might be better
compared to the balanced ones all right
so based on all this uh evaluation the
data scientist says they're going to
continue to work with these two models
all right and the balance begging one
and then continue to make further
comparisons all right so then we
continue to keep refining on our
evaluation work here we're going to
train the models one more time again so
we again do a training test plate and
then we do that for this particular uh
approach model and then we print out we
print out what is called a
classification report and this is
basically a summary of all those metrics
that I talk about just now so just now
remember I said the the there was
several evaluation metrics right so uh
we had the confusion matrics the
accuracy the Precision the recall the Au
ccore so here with the um classification
report I can get a summary of all of
that so I can see all the values here
okay for this particular model begging
Tomac links and then I can do that for
another model the random Forest
borderline SME and then I can do that
for another model which is the balance
ping so again we see this a lot of
comparison between different models
trying to figure out what all these
evaluation metrics are telling us all
right then again we have a confusion
Matrix so we generate a confusion Matrix
for the bagging with the toac links
under sampling for the random followers
with the borderline mod over sampling
and just balance begging by itself then
again we compare between these three uh
models uh using the confusion Matrix
evaluation Matrix and then we can kind
of come to some conclusions all right so
right so now we look at all the data
then we move on and look at another um
another kind of evaluation metrix which
is the r score right so this is one of
the other evaluation metrics I talk
about so this one is a kind of a curve
you look at it to see the area
underneath the curve this is called AOC
R area under the curve sorry Au Au R
area under the curve all right so the
area under the curve uh
score will give us some idea about the
threshold that we're going to use for
classif ification so we can examine this
for the bagging classifier for the
random forest classifier for the balance
bagging classifier okay then we can also
again do that uh finally we can check
the classification report of this
particular model so we keep doing this
over and over again evaluating this m
The Matrix the the accuracy Matrix the
evaluation Matrix for all these
different models so we keep doing this
over and over again for different
thresholds or for classification and so
as we keep drilling into these we kind
of get more and more understanding of
all these different models which one is
the best one that gives the best
performance for our data set okay so
finally we come to this conclusion this
particular model is not able to reduce
the record on failure test than
95.8% on the other hand balance begging
with a decision thresold of 0.6 is able
to have a better recall blah blah blah
Etc so finally after having done all of
this evalu ations
okay this is the conclusion
so after having gone so right now we
have gone through all the steps of the
Machining learning life cycle and which
means we have right now or the data
scientist right now has gone through all
these
steps uh which is now we have done this
validation so we have done the cleaning
exploration preparation transformation
the future engineering we have developed
and trained multiple models we have
evaluated all these different models so
right now we have reached this stage so
at this stage we as the data scientist
kind of have completed our job so we've
come to some very useful conclusions
which we now can share with our
colleagues all right and based on this
uh conclusions or recommendations
somebody is going to choose a
appropriate model and that model is
going to get deployed for realtime use
in a real life production environment
okay and that decision is going to be
made based on the recommendations coming
from the data scientist at the end of
this phase okay so at the end of this
phase the data scientist is going to
come up with these conclusions so
conclusions is okay if the engineering
team they are looking okay the
engineering team right the engineering
team if they are looking for the highest
failure detection rate possible then
they should go with this particular
model okay
and if they want a balance between
precision and recall then they should
choose between the begging model with a
0.4 decision threshold or the random
forest model with a 0.5 threshold but if
they don't care so much about predicting
every failure and they want the highest
Precision possible then they should opt
for the begging toax link classifier
with a bit higher decision threshold and
so this is the key thing that the data
scientist is going to give right this is
the key takeaway this is the kind of the
end result of the entire machine
learning life cycle right now the data
scientist is going to tell the
engineering team all right you guys
which is more important for you point a
point B or Point C make your decision so
the engineering team will then discuss
among themselves and say hey you know
what what we want is we want to get the
highest failure detection possible
because any kind kind of failure of that
machine or the product on the samply
line is really going to screw us up big
time so what we're looking for is the
model that will give us the highest
failure detection rate we don't care
about Precision but we want to be make
sure that if there's a failure we are
going to catch it right so that's what
they want and so the data scientist will
say Hey you go for the balance begging
model okay then the data scientist saves
this all right uh and then once you have
saved this uh you can then go right
ahead and deploy that so you can go
right ahead and deploy that to
production okay and so if you want to
continue we can actually further
continue this modeling problem so just
now I model this problem as a binary
classification problem uh sorry just I
modeled this problem as a binary
classification which means it's either
zero or one either fail or not fail but
we can also model it as a multiclass
classification problem right because as
as I said earlier just now for the
Target variable colum which is sorry for
the failure type colume you actually
have multiple kinds of failures right
for example you may have a power failure
uh you may have a towar failure uh you
may have a overstrain failure so now we
can model the problem slightly
differently so we can model it as a
multiclass classification problem and
then we go through the entire same
process that we went through just now so
we create different models we test this
out but now the confusion Matrix is for
a multiclass classification isue right
so we're going
to check them out we're going to again
uh try different algorithms or models
again train and test our data set do the
training test split uh on these
different models all right so we have
like for example we have bon random
Forest B random Forest a great search
then you train the models using what is
called hyperparameter tuning then you
get the scores all right so you get the
same evaluation scores again you check
out the evaluation scores compare
between them generate a confusion Matrix
so this is a multiclass confusion Matrix
and then you come to the final
conclusion so now if you are interested
to frame your problem domain as a
multiclass classification problem all
right then these are the recommendations
from the data scientist so the data
scientist will say you know what I'm
going to pick this particular model the
balance backing classifier and these are
all the reasons that the data scientist
is going to give as a rational for
selecting this particular
model and then once that's done you save
the model and that's that's it that's it
so that's all done now and so then the
uh the model the machine learning model
now you can put it live run it on the
server and now the machine learning
model is ready to work which means it's
ready to generate predictions right
that's the main job of the machine
learning model you have picked the best
machine learning model with the best
evaluation metrics for whatever accur
see goal you're trying to achieve and
now you're going to run it on a server
and now you're going to get all this
real time data that's coming from your
sensus you're going to pump that into
your machine learning model your machine
learning model will pump out a whole
bunch of predictions and we're going to
use that predictions in real time to
make real time real world decision
making right you're going to say okay
I'm predicting that that machine is
going to fail on Thursday at 5:00 p.m.
so you better get your service folks in
to service it on Thursday 2: p.m. or you
know whatever so you can you know uh
make decisions on when you want to do
your maintenance you know and and make
the best decisions to optimize the cost
of Maintenance etc etc and then based on
the
results that are coming up from the
predictions so the predictions may be
good the predictions may be lousy the
predictions may be average right so we
are we're constantly monitoring how good
or how useful are the predictions
generated by this realtime model that's
running on the server and based on our
monitoring we will then take some new
data and then repeat this entire life
cycle again so this is basically a
workflow that's iterative and we are
constantly or the data scientist is
constantly getting in all these new data
points and then refining the model
picking maybe a new model deploying the
new model onto the server and so on all
right and so that's it so that is
basically your machine learning workflow
in a nutshell okay so for this
particular approach we have used a bunch
of uh data science libraries from python
so we have used pandas which is the most
B basic data science libraries that
provides all the tools to work with raw
data we have used numai which is a high
performance library for implementing
complex array metrix operations we have
used met plot lip and cbon which is used
for doing the Eda the explorat
exploratory data analysis phase machine
learning where you visualize all your
data we have used psyit learn which is
the machine L learning library to do all
your implementation for all your call
machine learning algorithms uh we we we
have not used this because this is not a
deep learning uh problem but if you are
working with a deep learning problem
like image classification image
recognition object detection okay
natural language processing text
classification well then you're going to
use these libraries from python which is
tensor flow okay and also py
to and then lastly that whole thing that
whole data science project that you saw
just now this entire data science
project is actually developed in
something called a Jupiter notebook so
all this python code along with all the
observations from the data
scientists okay for this entire data
science project was actually run in
something called a Jupiter notebook so
that is uh the
most widely used tool for interactively
developing and presenting data science
projects okay so that brings me to the
end of this entire presentation I hope
that you find it useful for you and that
you can appreciate the importance of
machine learning and how it can be
applied in a real life use case in a
typical production environment all right
thank you all so much for watching