Hello everyone, my name is Victor. I'm
your friendly neighborhood data
scientist from DreamCatcher. So in this
presentation, I would like to talk about
a specific industry use case of AI or
machine learning which is predictive
maintenance. So I will be covering these
topics and feel free to jump forward to
the specific part in the video where I
talk about all these topics. So I'm going
to start off with a general preview of
AI and machine learning. Then, I'll
discuss the use case which is predictive
maintenance. I'll talk about the basics
of machine learning, the workflow of
machine learning, and then we will come
to the meat of this presentation which
is essentially a demonstration of the
machine learning workflow from end to
end on a real life predictive
maintenance domain problem. All right, so
without any further ado, let's jump into
it. So let's start off with a quick
preview of AI and machine learning. Well
AI is a very general term, it encompasses
the entire area of science and
engineering that is related to creating
software programs and machines that
will be capable of performing tasks
that would normally require human
intelligence. But AI is a catchall term,
so really when we talk about apply AI,
how we use AI in our daily work, we are
really going to be talking about machine
learning. So machine learning is the
design and application of software
algorithms that are capable of learning
on their own without any explicit human
intervention. And the primary purpose of
these algorithms are to optimize
performance in a specific task. And the
primary performance or the primary task
that you want to optimize performance in
is to be able to make accurate
predictions about future outcomes based
on the analysis of historical data
from the past. So essentially machine
learning is about making predictions
about the future or what we call
predictive analytics.
And there are many different
kinds of algorithms that are available in
machine learning under the three primary
categories of supervised learning,
unsupervised learning, and reinforcement
learning. And here we can see some of the
different kinds of algorithms and their
use cases in various areas in
industry. So we have various domain use
cases
for all these different kind of
algorithms, and we can see that different
algorithms are fitted for different use cases.
Deep learning is an advanced form
of machine learning that's based on
something called an artificial neural
network or ANN for short, and this
essentially simulates the structure of
the human brain whereby neurons
interconnect and work together to
process and learn new information. So DL
is the foundational technology for most
of the popular AI tools that you
probably have heard of today. So I'm sure
you have heard of ChatGPT if you haven't
been living in a cave for the past 2
years. And yeah, so ChatGPT is an example
of what we call a large language model
and that's based on this technology
called deep learning. Also, all the modern
computer vision applications where a
computer program can classify images or
detect images or recognize images on
its own, okay, we call this computer
vision applications. They also use
this particular form of machine learning
called deep learning, right? So this is a
example of an artificial neural network.
For example, here I have an image of a
bird that's fed into this artificial
neural network, and output from this
artificial neural network is a
classification of this image into one of
these three potential categories. So in
this case, if the ANN has been trained
properly, we fit in this image, this
ANN should correctly classify this image
as a bird, right? So this is a image
classification problem which is a
classic use case for an artificial
neural network in the field of computer
vision. And just like in the case of
machine learning, there are a variety of
algorithms that are available for
deep learning under the category of
supervised learning and also
unsupervised learning.
All right, so this is how we can
kind of categorize this. You can think of
AI is a general area of smart systems
and machine. Machine learning is
basically apply AI and deep learning
is a
subspecialization of machine learning
using a particular architecture called
an artificial neural network.
And generative AI, so if you talk
about ChatGPT, okay, Google Gemini,
Microsoft Copilot, okay, all these
examples of generative AI, they are
basically large language models, and they
are a further subcategory within the
area of deep
learning. And there are many applications
of machine learning in industry right
now, so pick which particular industry
are you involved in, and these are all the
specific areas of
applications, right? So probably, I'm
going to guess the vast majority of you
who are watching this video, you're
probably coming from the manufacturing
industry, and so in the manufacturing
industry some of the standard use cases
for machine learning and deep learning
are predicting potential problems, okay?
So sometimes you call this predictive
maintenance where you want to predict
when a problem is going to happen and
then kind of address it before it
happens. And then monitoring systems,
automating your manufacturing assembly
line or production line, okay, smart
scheduling, and detecting anomaly on your
production line.
Okay, so let's talk about the use
case here which is predictive
maintenance, right? So what is predictive
maintenance? Well predictive maintenance,
here's the long definition, is a
equipment maintenance strategy that
relies on real-time monitoring of
equipment conditions and data to predict
equipment failures in advance.
And this uses advanced data models,
analytics, and machine learning whereby
we can reliably assess when failures are
more likely to occur, including which
components are more likely to be
affected on your production or assembly
line. So where does predictive
maintenance fit into the overall scheme
of things, right? So let's talk about the
kind of standard way that, you know,
factories or production
lines, assembly lines in factories tend
to handle maintenance issues say
10 or 20 years ago, right? So what you
have is the, what you would probably
start off is the most basic mode
which is reactive maintenance. So you
just wait until your machine breaks down
and then you repair, right? The simplest,
but, of course, I'm sure if you have worked on a
production line for any period of time,
you know that this reactive maintenance
can give you a whole bunch of headaches
especially if the machine breaks down
just before a critical delivery deadline,
right? Then you're going to have a
backlog of orders and you're going to
run to a lot of problems. Okay, so we move on
to preventive maintenance which is
you regularly schedule a maintenance of
your production machines to reduce
the failure rate. So you might do
maintenance once every month, once every
two weeks, whatever. Okay, this is great,
but the problem, of course, then is well
sometimes you're doing too much
maintenance, it's not really necessary,
and it still doesn't totally prevent
this, you know, a failure of the
machine that occurs outside of your planned
maintenance, right? So a bit of an
improvement, but not that much better.
And then, these last two categories is
where we bring in AI and machine
learning. So with machine learning, we're
going to use sensors to do real-time
monitoring of the data, and then using
that data we're going to build a machine
learning model which helps us to predict,
with a reasonable level of accuracy, when
the next failure is going to happen on
your assembly or production line on a
specific component or specific machine,
right? So you just want to be predict to
a high level of accuracy like maybe
to the specific day, even the specific
hour, or even minute itself when you
expect that particular product to fail
or the particular machine to fail. All
right, so these are the advantages of
predictive maintenance. It minimizes
the occurrence of unscheduled downtime, it
gives you a real-time overview of your
current condition of assets, ensures
minimal disruptions to productivity,
optimizes time you spend on maintenance work,
optimizes the use of spare parts, and so
on. And of course there are some
disadvantages, which is the
primary one, you need a specialized set
of skills among your engineers to
understand and create machine learning
models that can work on the real-time
data that you're getting. Okay, so we're
going to take a look at some real life
use cases. So these are a bunch of links
here, so if you navigate to these links
here, you'll be able to get a look at
some real life use cases of machine
learning in predictive maintenance. So
the IBM website, okay, gives you a look at
a bunch of five use cases, so you can
click on these links and follow up with
them if you want to read more. Okay, this
is waste management, manufacturing, okay,
building services, and renewable energy,
and also mining, right? So these are all
use cases, if you want to know more about
them, you can read up and follow them
from this website. And this website
gives, this is a pretty good website. I
would really encourage you to just look
through this if you're interested in
predictive maintenance. So here, it tells
you about, you know, an industry survey of
predictive maintenance. We can see that a
large portion of the industry,
manufacturing industry agreed that
predictive maintenance is a real need to
stay competitive and predictive
maintenance is essential for
manufacturing industry and will gain
additional strength in the future. So
this is a survey that was done quite
some time ago and this was the results
that we got back. So we can see the vast
majority of key industry players in the
manufacturing sector, they consider
predictive maintenance to be a very
important
activity that they want to
incorporate into their workflow, right?
And we can see here the kind of ROI that
we expect on investment in predictive
maintenance, so 45% reduction in downtime,
25% growth in productivity, 75% fault
elimination, 30% reduction in maintenance
cost, okay? And best of all, if you really
want to kind of take a look at examples,
all right, so there are all these
different companies that have
significantly invested in predictive
maintenance technology in their
manufacturing processes. So PepsiCo, we
have got Frito-Lay, General Motors, Mondi, Ecoplant,
all right? So you can jump over here
and take a look at some of these
use cases. Let me perhaps, let me try and
open this up, for example, Mondi, right? You
can see Mondi has impl- oops. Mondi has used
this particular piece of software
called MATLAB, all right, or MathWorks
sorry, to do predictive maintenance
for their manufacturing processes using
machine learning. And we can talk, you can
study how they have used it, all right,
and how it works, what was their
challenge, all right, the problems they
were facing, the solution that they use
using this MathWorks Consulting piece of
software, and data that they collected in
a MATLAB database, all right, sorry
in a Oracle database.
So using MathWorks from MATLAB, all
right, they were able to create a deep
learning model to, you know, to
solve this particular issue for their
domain. So if you're interested, please, I
strongly encourage you to read up on all
these real life customer stories with
showcase use cases for predictive
maintenance. Okay, so that's it for
real life use cases for predictive maintenance.
Now in this topic, I'm
going to talk about machine learning
basics, so what is actually involved
in machine learning, and I'm going to
give a very quick, fast, conceptual, high
level overview of machine learning, all
right? So there are several categories of
machine learning, supervised, unsupervised,
semi-supervised, reinforcement, and deep
learning, okay? And let's talk about the
most common and widely used category of
machine learning which is called
supervised learning. So the particular use
case here that I'm going to be
discussing, predictive maintenance, it's
basically a form of supervised learning.
So how does supervised learning work?
Well in supervised learning, you're going
to create a machine learning model by
providing what is called a labelled data
set as a input to a machine learning
program or algorithm. And this dataset
is going to contain what is called an
independent or feature variables, all
right, so this will be a set of variables.
And there will be one dependent or
target variable which we also call the
label, and the idea is that the
independent or the feature variables are
the attributes or properties of your
data set that influence the dependent or
the target variable, okay? So this process
that I've just described is called
training the machine learning model, and
the model is fundamentally a
mathematical function that best
approximates the relationship between
the independent variables and the
dependent variable. All right, so that's
quite a bit of a mouthful, so let's jump
into a diagram that maybe illustrates
this more clearly. So let's say you have
a dataset here, an Excel spreadsheet,
right? And this Excel spreadsheet has a
bunch of columns here and a bunch of
rows, okay? So these rows here represent
observations, or these rows are what
we call observations or samples or data
points in our data set, okay? So let's
assume this data set is gathered by a
marketing manager at a mall, at a retail
mall, all right? So they've got all this
information about the customers who
purchase products at this mall, all right?
So some of the information they've
gotten about the customers are their
gender, their age, their income, and the
number of children. So all this
information about the customers, we call
this the independent or the feature
variables, all right? And based on all
this information about the customer, we
also managed to get some or we record
the information about how much the
customer spends, all right? So this
information or these numbers here, we call
this the target variable or the
dependent variable, right? So on the
single row, the data point, one single sample, one
single data point, contains all the data
for the feature variables and one single
value for the label or the target
variable, okay? And the primary purpose of
the machine learning model is to create
a mapping from all your feature
variables to your target variable, so
somehow there's going to be a function,
okay, this will be a mathematical
function that maps all the values of
your feature variable to the value of
your target variable. In other words, this
function represents the relationship
between your feature variables and your
target variable, okay? So this whole thing,
this training process, we call this the
fitting the model. And the target
variable or the label, this thing here,
this column here, or the values here,
these are critical for providing a
context to do the fitting or the
training of the model. And once you've
got a trained and fitted model, you can
then use the model to make an accurate
prediction of target values
corresponding to new feature values that
the model has yet to encounter or yet to
see, and this, as I've already said
earlier, this is called predictive
analytics, okay? So let's see what's
actually happening here, you take your
training data, all right, so this is this
whole bunch of data, this data set here
consisting of a thousand rows of
data, 10,000 rows of data, you take this
entire data set, all right, this entire
data set, you jam it into your machine
learning algorithm, and a couple of hours
later your machine learning algorithm
comes up with a model. And the model is
essentially a function that maps all
your feature variables which is these
four columns here, to your target
variable which is this one single column
here, okay? So once you have the model, you
can put in a new data point. So basically
the new data point represents data about a
new customer, a new customer that you
have never seen before. So let's say
you've already got information about
10,000 customers that have visited this
mall and how much each of these 10,000
customers have spent when they are at this
mall. So now you have a totally new
customer that comes in the mall, this
customer has never come into this mall
before, and what we know about this
customer is that he is a male, the age is
50, the income is 18, and they have nine
children. So now when you take this data
and you pump that into your model, your
model is going to make a prediction, it's
going to say, hey, you know what? Based on
everything that I have been trained before
and based on the model I've developed,
I am going to predict that a customer
that is of a male gender, of the age 50
with the income of 18, and nine children,
that customer is going to spend 25 ringgit
at the mall. And this is it, this is what
you want. Right there, right here,
can you see here? That is the final
output of your machine learning model.
It's going to make a prediction about
something that it has not ever seen
before, okay? That is the core, this is
essentially the core of machine learning.
Predictive analytics, making prediction
about the future
based on a historical data set.
Okay, so there are two areas of
supervised learning, regression and
classification. So regression is used to
predict a numerical target variable, such
as the price of a house or the salary of
an employee, whereas classification is
used to predict a categorical target
variable or class label, okay? So for
classification you can have either
binary or multiclass, so, for example,
binary will be just true or false, zero
or one. So whether your machine is going
to fail or is it not going to fail, right?
So just two classes, two possible,
outcomes, or is the customer going to
make a purchase or is the customer not
going to make a purchase. We call this
binary classification. And then for
multiclass, when there are more than two
classes or types of values. So, for
example, here this would be a
classification problem. So if you have a
data set here, you've got information
about your customers, you've got your
gender of the customer, the age of the
customer, the salary of the customer, and
you also have record about whether the
customer made a purchase or not, okay? So
you can take this data set to train a
classification model, and then the
classification model can then make a
prediction about a new customer, and
they're going to predict zero which
means the customer didn't make a
purchase or one which means the customer
make a purchase, right? And regression,
this is regression, so let's say you want
to predict the wind speed, and you've got
historical data about all these four
other independent variables or feature
variables, so you have recorded
temperature, the pressure, the relative
humidity, and the wind direction for the
past 10 days, 15 days, or whatever, okay? So
now you are going to train your machine
learning model using this data set, and
the target variable column, okay, this
column here, the label is basically a
number, right? So now with this number,
this is a regression model, and so now
you can put in a new data point, so a new
data point means a new set of values for
temperature, pressure, relative humidity,
and wind direction, and your machine
learning model will then predict the
wind speed for that new data point, okay?
So that's a regression model.
All right. So in this particular topic
I'm going to talk about the workflow of
that's involved in machine learning. So
in the previous slides, I talked about
developing the model, all right? But
that's just one part of the entire
workflow. So in real life when you use
machine learning, there's an end-to-end
workflow that's involved. So the first
thing, of course, is you need to get your
data, and then you need to clean your
data, and then you need to explore your
data. You need to see what's going on in
your data set, right? And your data set,
real life data sets are not trivial, they
are hundreds of rows, thousands of rows,
sometimes millions of rows, billions of
rows, we're talking about billions or
millions of data points especially if
you're using an IoT sensor to get data
in real time. So you've got all these
super large data sets, you need to clean
them, and explore them, and then you need
to prepare them into a right format so
that you can put them into the training
process to create your machine learning
model, and then subsequently you check
how good is the model, right? How accurate
is the model in terms of its ability to
generate predictions for the
future, right? How accurate are the
predictions that are coming up from your
machine learning model. So that's
validating or evaluating your model, and
then subsequently if you determine that
your model is of adequate accuracy to
meet whatever your domain use case
requirements are, right? So let's say the
accuracy that's required for your domain
use case is
85%, okay? If my machine learning model
can give an 85% accuracy rate, I think
it's good enough, then I'm going to
deploy it into real world use case. So
here the machine learning model gets
deployed on the server, and then other,
you know, other data sources are going to
be captured from somewhere. That data is
pump into the machine learning model. The
machine learning model generates
predictions, and those predictions are
then used to make decisions on the
factory floor in real time or in any
other particular scenario. And then you
constantly monitor and update the model,
you get more new data, and then the
entire cycle repeats itself. So that's
your machine learning workflow, okay, in a
nutshell. Here's another example of
the same thing maybe in a slightly
different format, so, again, you have your
data collection and preparation. Here we
talk more about the different kinds of
algorithms that available to create a
model, and I'll talk about this more in
detail when we look at the real world
example of a end-to-end machine learning
workflow for the predictive maintenance
use case. So once you have chosen the
appropriate algorithm, you then have
trained your model, you then have
selected the appropriate train model
among the multiple models. You are
probably going to develop multiple
models from multiple algorithms, you're
going to evaluate them all, and then
you're going to say, hey, you know what?
After I've evaluated and tested that,
I've chosen the best model, I'm going to
deploy the model, all right, so this is
for real life production use, okay? Real
life sensor data is going to be pumped
into my model, my model is going to
generate predictions, the predicted data
is going to used immediately in real
time for real life decision making, and
then I'm going to monitor, right, the
results. So somebody's using the
predictions from my model, if the
predictions are lousy, that goes into the
monitoring, the monitoring system
captures that. If the predictions are
fantastic, well that is also captured by the
monitoring system, and that gets
feedback again to the next cycle of my
machine learning
pipeline. Okay, so that's the kind of
overall view, and here are the kind of
key phases of your workflow. So one of
the important phases is called EDA,
exploratory data analysis and in this
particular phase, you're going to
do a lot of stuff, primarily just to
understand your data set. So like I said,
real life data sets, they tend to be very
complex, and they tend to have various
statistical properties, all right,
statistics is a very important component
of machine learning. So an EDA helps you
to kind of get an overview of your data
set, get an overview of any problems in
your data set like any data that's
missing, the statistical properties of your
data set, the distribution of your data
set, the statistical correlation of
variables in your data set, etc,
etc. Okay, then we have data cleaning or
sometimes you call it data cleansing, and
in this phase what you want to do is
primarily, you want to kind of do things
like remove duplicate records or rows in
your table, you want to make sure that
your data or your data
points or your samples have appropriate IDs,
and most importantly, you want to make
sure there's not too many missing values
in your data set. So what I mean by
missing values are things like that,
right? You have got a data set, and for
some reason there are some cells or
locations in your data set which are
missing values, right? And if you have a
lot of these missing values, then you've
got a poor quality data set, and you're
not going to be able to build a good
model from this data set. You're not
going to be able to train a good machine
learning model from a data set with a
lot of missing values like this. So you
have to figure out whether there are a
lot of missing values in your data set,
how do you handle them. Another thing
that's important in data cleansing is
figuring out the outliers in your data
set. So outliers are things like this,
you know, data points that are very far from
the general trend of data points in your
data set, right? And so there are also
several ways to detect outliers in your
data set, and there are several ways to
handle outliers in your data set.
Similarly as well, there are several ways
to handle missing values in your data
set. So handling missing values, handling
outliers, those are really two very key
importance of data
cleansing, and there are many, many
techniques to handle this, so a data
scientist needs to be acquainted with
all of this. All right, why do I need to
do data cleansing? Well, here is the key
point.
If you have a very poor quality data set,
which means you've got a lot of outliers
which are errors in your data set, or you
got a lot of missing values in your data
set, even though you've got a fantastic
algorithm, you've got a fantastic model,
the predictions that your model is going
to give is absolutely rubbish. It's kind
of like taking water and putting water
into the tank of a Mercedes-Benz. So
Mercedes-Benz is a great car, but if you
take water and put it into your
Mercedes-Benz, it will just die, right? Your
car will just die, it can't run on water,
right? On the other hand, if you have a
Myvi, Myvi is just a lousy, shit car, but if
you take a high octane, good petrol and
you put into a Myvi, the Myvi will just go at,
you know, 100 miles an hour. It would just
completely destroy the Mercedes-Benz in
terms of performance, so it
doesn't really matter what model you're
using here, right? So you can be using the most
fantastic model like the
Mercedes-Benz or machine learning, but if
your data is lousy quality, your
predictions is also going to be rubbish,
okay? So cleansing data set is, in fact,
probably the most important thing that
data scientists need to do and that's
what they spend most of the time doing,
right, building the model, training the
model, getting the right algorithms, and
so on, that's really a small portion of
the actual machine learning workflow,
right? The actual machine learning
workflow, the vast majority of time is on
cleaning and organizing your
data. Then you have something called
feature engineering which is you
preprocess the feature variables of
your original data set prior to using
them to train the model, and this is
either through addition, deletion,
combination, or transformation of these
variables. And then the idea is you want
to improve the predictive accuracy of
the model, and also because some models
can only work with numeric data, so you
need to transform categorical data into
numeric data. All right, so just now, in
the earlier slides, I showed you that you
take your original data set, you pump it
into algorithm, and then a couple of hours
later, you get a machine learning model,
right? So you didn't do anything to your
data set, to the feature variables in
your data set before you pump it into a
machine learning algorithm. So
what I showed you earlier is you just
take the data set exactly as it is and
you just pump it into the algorithm,
couple of hours later, you get a model,
right? But that's not what generally
happens in in real life. In real life,
you're going to take all the original
feature variables from your data set and
you're going to transform them in some
way. So you can see here these are the
columns of data from my original data set,
and before I actually put all these data
points from my original data set into my
algorithm to train and get my model, I
will actually transform them, okay? So the
transformation of these feature variable
values, we call this feature engineering.
And there are many, many techniques to do
feature engineering, so one-hot encoding,
scaling, log transformation,
discretization, date extraction, boolean
logic, etc, etc.
Okay, then finally we do something
called a train-test split, so where we
take our original dataset, right? So this
was the original dataset, and we break
it into two parts, so one is called the
training dataset and the other is
called the test dataset. And the primary
purpose for this is when we feed and
train the machine learning model, we're
going to use what is called the training
dataset, and when we want to evaluate
the accuracy of the model, right? So this
is the key part of your machine learning
life cycle because you are not only just
going to have one possible models
because there are a vast range of
algorithms that you can use to create a
model. So fundamentally you have a wide
range of choices, right, like wide range
of cars, right? You want to buy a car, you
can buy a Myvi, you can buy a Perodua,
you can buy a Honda, you can buy a
Mercedes-Benz, you can buy a Audi, you can
buy a beamer, many, many different cars
that available for you if you want
to buy a car, right? Same thing. With a
machine learning model there are a vast
variety of algorithms that you can
choose from in order to create a model,
and so once you create a model from a
given algorithm you need to say, hey, how
accurate is this model that I've created
from this algorithm. And different
algorithms are going to create different
models with different rates of accuracy.
And so the primary purpose of the test
dataset is to evaluate the accuracy
of the model to see hey, is this model
that I've created using this algorithm,
is it adequate for me to use in a real
life production use case? Okay? So that's
what it's all about. Okay, so this is my
original dataset, I break it into my
feature dataset and
also my target variable column, so my
feature variable columns, the target
variable columns, and then I further break
it into a training dataset and a test
dataset. The training dataset is to use
to train, to create the machine learning
model. And then once the machine learning
model is created, I then use the test
dataset to evaluate the accuracy of the
machine learning model.
All right. And then finally we can
see what are the different parts or
aspects that go into a successful model,
so EDA about 10%, data cleansing about
20%, feature engineering about
25%, selecting a specific algorithm about
10%, and then training the model from
that algorithm about 15%, and then
finally evaluating the model, deciding
which is the best model with the highest
accuracy rate, that's about 20%.
All right, so we have reached the
most interesting part of this
presentation which is the demonstration
of an end-to-end machine learning workflow
on a real life dataset that
demonstrates the use case of predictive
maintenance. So for the data set for
this particular use case, I've used a
data set from Kaggle. So for those of you
are not aware of this, Kaggle is the
world's largest open-source community
for data science and AI, and they have a
large collection of datasets from all
various areas of industry and human
endeavor, and they also have a large
collection of models that have been
developed using these data sets. So here
we have a data set for the particular
use case, predictive maintenance, okay? So
this is some information about the data
set, so in case you do not know how
to get to there, this is the URL to click
on, okay, to get to that dataset. So once
your at the data set here, you can- or the
page for about this dataset, you can see
all the information about this data set,
and you can download the data set in a
CSV format.
Okay, so let's take a look at the
dataset. So this dataset has a total of
10,000 samples, okay? And these are the
feature variables, the type, the product
ID, the air temperature, process
temperature, rotational speed, torque, tool
wear, and this is the target variable,
all right? So the target variable is what
we are interested in, what we are
interested in using to train the machine
learning model, and also what we are
interested to predict, okay? So these are
the feature variables, they describe or
they provide information about this
particular machine on the production
line, on the assembly line, so you might
know the product ID, the type, the air
temperature, process temperature,
rotational speed, torque, tool wear, right? So
let's say you've got a IoT sensor system
that's basically capturing all this data
about a product or a machine on your
production or assembly line, okay? And
you've also captured information about
whether is for a specific sample,
whether that sample experience a
failure or not, okay? So the target value
of zero, okay, indicates that there's no
failure. So zero means no failure, and we
can see that the vast majority of data
points in this data set are no failure.
And here we can see an example here
where you have a case of a failure, so a
failure is marked as a one, positive, and
no failure is marked as zero, negative,
all right? So here we have one type of a
failure, it's called a power failure. And
if you scroll down the data set, you see
there are also other kinds of failures
like a tool wear
failure, we have a overstrain failure
here, for example,
we also have a power failure again,
and so on. So if you scroll down through
these 10,000 data points, or if
you're familiar with using Excel to
filter out values in a column, you can
see that in this particular column here
which is the so-called target variable
column, you are going to have the vast
majority of values as zero which means
no failure, and some of the rows or the
data points you are going to have a
value of one, and for those rows that you
have a value of one, for example,
here you are- Sorry, for example, here you
are going to have different types of
failures, so like I said just now power
failure, tool set failure, etc, etc. So we are
going to go through the entire machine
learning workflow process with this dataset.
So to see an example of that, we are
going to use a- we're going to go to the
code section here, all right, so if I
click on the code section here. And right
down here we have see what is called a
dataset notebook. So this is basically a
Jupyter notebook. Jupyter is basically an
Python application which allows you to
create a Python machine learning
program that basically builds your
machine learning model, assesses or
evaluates its accuracy, and generates
predictions from it, okay? So here we have
a whole bunch of Jupyter notebooks that
are available, and you can select any one
of them. All these notebooks are
essentially going to process the data
from this particular dataset. So if I go
to this code page here, I've actually
selected a specific notebook that I'm
going to run through to demonstrate an
end-to-end machine learning workflow using
various machine learning libraries from
the Python programming language, okay? So
the particular notebook I'm going to
use is this particular notebook here, and
you can also get the URL for that
particular notebook from here.
Okay, so let's quickly do a quick
revision again. What are we trying to do
here? We're trying to build a machine
learning classification model, right? So
we said there are two primary areas of
supervised learning, one is regression
which is used to predict a numerical
target variable, and the second kind of
supervised learning is classification
which is what we're doing here. We're
trying to predict a categorical target
variable, okay? So in this particular
example, we actually have two kinds of
ways we can classify, either a binary
classification or a multiclass
classification. So for binary
classification, we are only going to
classify the product or machine as
either it failed or it did not fail, okay?
So if we go back to the dataset that I
showed you just now, if you look at this
target variable column, there are only
two possible values here. They are either
zero or one. Zero means there's no failure.
One means there's a failure, okay? So this
is an example of a binary classification.
Only two possible outcomes, zero or one,
didn't fail or fail, all right? Two
possible outcomes. And then we can also,
for the same dataset, we can extend it
and make it a multiclass classification
problem, all right? So if we kind of want
to drill down further, we can say that
not only is there a failure, we can
actually say there are different types of
failures, okay? So we have one category of
class that is basically no failure, okay?
Then we have a category for the
different types of failures, right? So you
can have a power failure, you could have
a tool wear failure,
you could have- let's go down
here, you could have a overstrain
failure, and etc, etc. So you can have
multiple classes of failure in addition
to the general overall or the majority
class of no failure, and that would be a
multiclass classification problem. So
with this data set, we are going to see
how to make it a binary classification
problem and also a multiclass
classification problem. Okay, so let's
look at the workflow. So let's say we've
already got the data, so right now we do
have the dataset. This is the dataset
that we have, so let's assume we've
somehow managed to get this dataset
from some IoT sensors that are
monitoring real-time data in our
production environment. On the assembly
line, on the production line we've got
sensors reading data that gives us all
these data that we have in this CSV file.
Okay, so we've already got the data, we've
retrieved the data, now we're going to go
on to the cleaning and exploration part
of your machine learning life cycle. All
right, so let's look at the data cleaning
part. So the data cleaning part, we're
interested in checking for missing
values and maybe removing the rows you
missing values, okay?
So the kind of things we can- sorry,
the kind of things we can do in missing
values, we can remove the rows missing
values, we can put in some new values,
some replacement values which could be a
average of all the values in that that
particular column, etc, etc, we could also try to
identify outliers in our data set and
also there are a variety of ways to deal
with that. So this is called data
cleansing which is a really important
part of your machine learning workflow,
right? So that's where we are now at,
we're doing cleansing, and then we're
going to follow up with
exploration. So let's look at the actual
code that does the cleansing here. So
here we are right at the start of the
machine learning life cycle here, so
this is a Jupyter notebook. So here we
have a brief description of the problem
statement, all right? So this dataset
reflects real life predictive
maintenance encountered industry with
measurements from real equipment. The
features description is taken directly
from the data source set. So here we have
a description of the six key features in
our dataset type which is the quality
of the product, the air temperature, the
process temperature, the rotational speed,
the torque, and the tool wear, all right? So
these are the six feature variables, and
there are the two target variables, so
just now- I showed you just now there's
one target variable which only has two
possible values, either zero or one, okay?
Zero or one means failure or no failure,
so that will be this column here, right?
So let me go all the way back up to here.
So this column here, we already saw it
only has two possible values, it's either zero or
one. And then we also have this column
here, and this column here is basically
the failure type. And so the- we have- as I
already demonstrated just now, we do have
several categories of types of
failure, and so here we call this
multiclass
classification. So we can either build a
binary classification model for this
problem domain, or we can build a
multiclass
classification problem, all right. So this
Jupyter notebook is going to demonstrate
both approaches to us. So first step, we
are going to write all this Python code
that's going to import all the libraries
that we need to use, okay? So this is
basically Python code, okay, and it's
importing the relevant machine learn-
oops. We are importing the relevant
machine learning libraries related to
our domain use case, okay? Then we load in
our dataset, okay, so this our dataset.
We describe it, we have some quick
insights into the dataset. And then
we just take a look at all the variables
of the feature variables, etc, and so on.
What we're doing now is just
doing a quick overview of the dataset,
so this all this Python code here that
we're writing is allowing us, the data
scientist, to get a quick overview of our
dataset, right, okay, like how many varia-
how many rows are there, how many columns
are there, what are the data types of the
columns, what are the name of the columns,
etc, etc. Okay, then we zoom in on to the
target variables. So we look at the
target variables, how many counts
there are of this target variable, and
so on. How many different types of
failures there are. Then you want to
check whether there are any
inconsistencies between the target and
the failure type, etc. Okay, so when you do
all this checking, you're going to
discover there are some discrepancies in
your dataset, so using a specific Python
code to do checking, you're going to say
hey, you know what? There's some errors
here, right? There are nine values that
classify as failure in target variable,
but as no failure in the failure type
variable, so that means there's a
discrepancy in your data point, right?
So these are all the ones that
are discrepancies because the target
variable says one, and we already know
that target variable one is supposed to
mean there is a failure, right? Target
variable one is supposed to mean there is
a failure, so we are kind of expecting to
see the failure classification, but some
rows actually say there's no failure
although the target type is one. Well here
is a classic example of an error that
can very well occur in a dataset, so now
the question is what do you do with
these errors in your dataset, right? So
here the data scientist says, I think it
would make sense to remove those
instances, and so they write some code
then to remove those instances or those
rows or data points from the overall
data set, and same thing we can, again,
check for other issues. So we find there's
another issue here with our data set which
is another warning, so, again, we can
possibly remove them. So you're going to
remove 27 instances or rows from your
overall data set. So your data set has
10,000 rows or data points. You're
removing 27 which is only 0.27 of the
entire dataset. And these were the
reasons why you removed them, okay? So if
you're just removing 0.27% of the
entire dataset, no big deal, right? Still
okay, but you needed to remove them
because these errors right, these
27
errors, okay, data points with errors in
your dataset could really affect the
training of your machine learning model.
So we need to do your data cleansing,
right? So we are actually cleansing now
some kind of data that is
incorrect or erroneous in your original
dataset. Okay, so then we go on to the
next part which is called EDA, right? So
EDA is where we kind of explore our data,
and we want to, kind of, get a visual
overview of our data as a whole, and also
take a look at the statistical
properties of our data. The statistical
distribution of the data in all the
various columns, the correlation between
the variables, between the feature
variables different columns, and also the
feature variable and the target variable.
So all of this is called EDA, and EDA in
a machine learning workflow is typically
done through visualization,
all right? So let's go back here and take
a look, right? So, for example, here we are
looking at correlation, so we plot the
values of all the various feature
variables against each other and look
for potential correlations and patterns
and so on. And all the different shapes
that you see here in this pair plot, okay,
will have different meaning,
statistical meaning, and so the data
scientist has to, kind of, visually
inspect this pair plot, make some
interpretations of these different
patterns that he sees here, all right. So
these are some of the insights that
can be deduced from looking at these
patterns, so, for example, the torque and
rotational speed are highly correlated,
the process temperature and air
temperature also highly correlated, that
failures occur for extreme values of
some features, etc, etc. Then you can plot
certain kinds of charts. This called a
violin chart to, again, get new insights.
For example, regarding the torque and
rotational speed, it can see, again, that
most failures are triggered for much
lower or much higher values than the
mean when they're not failing. So all
these visualizations, they are there, and
a trained data scientist can look at
them, inspect them, and make some kind of
insightful deductions from them, okay?
Percentage of failure, right? The
correlation heat map, okay, between all
these different feature variables, and
also the target
variable, okay? The product types,
percentage of product types, percentage
of failure with respect to the product
type, so we can also kind of visualize
that as well. So certain products have a
higher ratio of failure compared to other
product types, etc. Or, for example, M
tends to fail more than H products, etc,
etc. So we can create a vast variety of
visualizations in the EDA stage, so you
can see here. And, again, the idea of this
visualization is just to give us some
insight, some preliminary insight into
our dataset that helps us to model it
more correctly. So some more insights
that we get into our data set from all
this visualization.
Then we can plot the distribution so we
can see whether it's a normal
distribution or some other kind of
distribution. We can have a box plot
to see whether there are any outliers in
your data set and so on, right? So we can
see from the box plots, we can see
rotational speed and have outliers. So we
already saw outliers are basically a
problem that you may need to kind of
tackle, right? So outliers are an issue,
it's a part of data cleansing. And
so you may need to tackle this, so we may
have to check okay, well where are the
potential outliers so we can analyze
them from the box plot, okay? But then
we can say well they are outliers, but
maybe they're not really horrible
outliers so we can tolerate them or
maybe we want to remove them. So we can
see what our mean and maximum values for
all these with respect to product type,
how many of them are above or highly
correlated with the product type in
terms of the maximum and minimum, okay,
and then so on. So the insight is well we
got 4.8% of the instances are outliers,
so maybe 4.87% is not really that much,
the outliers are not horrible, so we just
leave them in the dataset. Now for a
different dataset, the data scientist
could come to a different conclusion, so
then they would do whatever they've
deemed is appropriate to, kind of, cleanse
the dataset. Okay, so now that we have
done all the EDA, the next thing we're
going to do is we are going to do what
is called feature engineering. So we are
going to transform our original feature
variables and these are our original
feature variables, right? These are our
original feature variables, and we are
going to transform them, all right? We're
going to transform them in some sense
into some other form before we fit this
for training into our machine learning
algorithm, all right? So these are
examples of- let's say these are examples of a
original data set, right? And this is
examples, these are some of the examples,
you don't have to use all of them, but
these are some of the examples of what we
call feature engineering which you can
then transform your original values in
your feature variables to all these
transform values here. So we're going to
pretty much do that here, so we have a
ordinal encoding, we do scaling of the
data so the dataset is scaled, we use a
MinMax scaling, and then finally, we come
to do a modeling. So we have to split our
dataset into a training dataset and a
test dataset. So coming back to here again,
we said that before you train your
model, sorry, before you train your model,
you have to take your original dataset,
now this is a featured engineered dataset.
We're going to break it into two or
more subsets, okay. So one is called the
training dataset that we use to feed
and train a machine learning model. The
second is test dataset to evaluate the
accuracy of the model, okay? So we got
this training dataset, your test dataset,
and we also need
to sample. So from our original data set
we need to sample some points
that go into your training dataset, some
points that go in your test dataset. So
there are many ways to do sampling. One
way is to do stratified sampling where
we ensure the same proportion of data
from each stata or class because right
now we have a multiclass classification
problem, so you want to make sure the
same proportion of data from each strata or
class is equally proportional in the
training and test dataset as the
original dataset which is very useful
for dealing with what is called an
imbalanced dataset. So here we have an
example of what is called an imbalanced
dataset in the sense that you have the
vast majority of data points in your
data set, they are going to have the
value of zero for their target variable
column. So only a extremely small
minority of the data points in your dataset
will actually have the value of one
for their target variable column, okay? So
a situation where you have your class or
your target variable column where the
vast majority of values are from one
class and a tiny small minority are from
another class, we call this an imbalanced
dataset. And for an imbalanced dataset,
typically we will have a specific
technique to do the train test split
which is called stratified sampling, and
so that's what's exactly happening here.
We're doing a stratified split here, so
we are doing a train test split here,
and we are doing a stratified split.
And then now we actually develop the
models. So now we've got the train test
split, now here is where we actually
train the models.
Now in terms of classification there are
a whole bunch of
possibilities, right, that you can use.
There are many, many different algorithms
that we can use to create a
classification model. So these are an
example of some of the more common ones.
Logistic, support vector machine, decision
trees, random forest, bagging, balanced
bagging, boost, ensemble. So all
these are different algorithms which
will create different kinds of models
which will result in different accuracy
measures, okay? So it's the goal of the
data scientist to find the best model
that gives the best accuracy for the
given dataset, for training on that
given dataset. So let's head back, again,
to our machine learning workflow. So
here basically what I'm doing is I'm
creating a whole bunch of models here,
all right? So one is a random forest, one
is balanced bagging, one is a boost
classifier, one's a ensemble classifier,
and using all of these, I am going to
basically feed or train my model using
all these algorithms. And then I'm going
to evaluate them, okay? I'm going to
evaluate how good each of these models
are. And here you can see your
evaluation data, right? Okay and this is
the confusion matrix which is another
way of evaluating. So now we come to the,
kind of, the key part here which
is how do I distinguish between
all these models, right? I've got all
these different models which are built
with different algorithms which I'm
using to train on the same dataset, how
do I distinguish between all these
models, okay? And so for that sense, for
that we actually have a whole bunch of
common evaluation metrics for
classification, right? So this evaluation
metrics tell us how good a model is in
terms of its accuracy in
classification. So in terms of
accuracy, we actually have many different
models, sorry, many different measures,
right? You might think well, accuracy is
just accuracy, well that's all right, it's
just either it's accurate or it's not
accurate, right? But actually it's not
that simple. There are many different
ways to measure the accuracy of a
classification model, and these are some
of the more common ones. So, for example,
the confusion matrix tells us how many
true positives, that means the value is
positive, the prediction is positive, how
many false positives which means the
value is negative the machine learning
model predicts positive. How many false
negatives which means that the machine
learning model predicts negative, but
it's actually positive. And how many true
negatives there are which means that the
the machine learning model
predicts negative and the true value is
also negative. So this is called a
confusion matrix. This is one way we
assess or evaluate the performance of a
classification model,
okay? This is for binary
classification, we can also have
multiclass confusion matrix,
and then we can also measure things like
accuracy. So accuracy is the true
positives plus the true negatives which
is the total number of correct
predictions made by the model divided by
the total number of data points in your
dataset. And then you have also other
kinds of
measures such as recall. And this a
formula for recall, this is a formula for
the F1 score, okay? And then there's
something called the ROC curve, right? So
without going too much in the detail of
what each of these entails, essentially
these are all different ways, these are
different KPI, right? Just like if you
work in a company, you have different KPI,
right? Certain employees have certain KPI
that measures how good or how, you
know, efficient or how effective a
particular employee is, right? So the
KPI for your machine learning models
are ROC curve, F1 score, recall, accuracy,
okay, and your confusion matrix. So
fundamentally after I have built, right,
so here I've built my four different
models. So after I built these four
different models, I'm going to check and
evaluate them using all those different
metrics like, for example, the F1 score,
the precision score, the recall score, all
right. So for this model, I can check out
the ROC score, the F1 score, the precision
score, the recall score. Then for this
model, this is the ROC score, the F1 score,
the precision score, the recall score.
Then for this model and so on. So for
every single model I've created using my
training data set, I will have all my set
of evaluation metrics that I can use to
evaluate how good this model is, okay?
Same thing here, I've got a confusion
matrix here, right, so I can use that,
again, to evaluate between all these four
different models, and then I, kind of,
summarize it up here. So we can see from
this summary here that actually the top
two models, right, which are I'm going to
give a lot, as a data scientist, I'm now
going to just focus on these two models.
So these two models are bagging
classifier and random forest classifier.
They have the highest values of F1 score,
and the highest values of the ROC curve
score, okay? So we can say these are the
top two models in terms of accuracy, okay,
using the F1 evaluation metric and the
ROC AUC evaluation metric, okay? So these
results, kind of, summarize here, and
then we use different sampling
techniques, okay, so just now I talked
about different kinds of sampling
techniques, and so the idea of different
kinds of sampling techniques is to just
get a different feel for different
distributions of the data in different
areas of your dataset, so that you want
to just, kind of, make sure that your
your evaluation of accuracy is actually
statistically correct, right? So we can
do what is called oversampling and under
sampling which is very useful when
you're working with an imbalanced data
set. So this is a example of doing that, and
then here we, again, check out the
results for all these different
techniques we use. The F1 score, the AUC
score, all right, these are the two key
measures of accuracy, right? So and then
we can check out the scores for the
different approaches. Okay so we can see,
oh well, overall the models have lower
ROC AUC score, but they have a much
higher F1 score. The bagging classifier
had the highest ROC AUC score,
but F1 score was too low, okay. Then, in
the data scientist opinion, the random
forest with this particular technique of
sampling has an equilibrium between the F1
ROC, and AUC score. So the takeaway one
is the macro F1 score improves
dramatically using these sampling
techniques, so these models might be better
compared to the balanced ones, all right.
So based on all this evaluation, the
data scientist says they're going to
continue to work with these two models,
all right, and the balanced bagging one,
and then continue to make further
comparisons, all right. So then, we
continue to keep refining on our
evaluation work here. We're going to
train the models one more time again, so
we, again, do a training test split, and
then we do that for this particular
approach model. And then we
print out what is called a
classification report, and this is
basically a summary of all those metrics
that I talk about just now, so, just now,
remember I said there was
several evaluation metrics, right? So
we had the confusion matrix, the
accuracy, the precision, the recall, the AUC
ROC score. So here with the classification
report, I can get a summary of all of
that, so I can see all the values here,
okay, for this particular model, bagging
tomek links. And then, I can do that for
another model, the random forest
borderline SMOTE, and then I can do that
for another model which is the balanced
bagging. So, again, we see this a lot of
comparison between different models
trying to figure out what all these
evaluation metrics are telling us, all
right? Then, again, we have a confusion
matrix. So we generate a confusion matrix
for the bagging with the tomeks links
undersampling, for the random forest
with the borderline SMOTE oversampling,
and just balanced bagging by itself. Then,
again, we compare between these three
models using the confusion matrix,
evaluation matrix, and then we can kind
of come to some conclusions. All right, so,
right, so now we look at all the data,
then we move on and look at another
another kind of evaluation metrics which
is the ROC score, right? So this is one of
the other evaluation metrics I talk
about. So this one is a kind of a curve,
you look at it to see the area
underneath the curve, this is called AOC
ROC area under the curve, sorry, AUC ROC
area under the curve. All right, so the
area under the curve
score will give us some idea about the
threshold that we're going to use for
classification, so we can examine this
for the bagging classifier, for the
random forest classifier, for the balanced
bagging classifier, okay? Then we can also,
again, do that- finally we can check
the classification report of this
particular model. So we keep doing this
over and over again, evaluating this
the matrix, the accuracy matrix, the
evaluation matrix for all these
different models. So we keep doing this
over and over again for different
thresholds or for classification, and so
as we keep drilling into these, we kind
of get more and more understanding of
all these different models, which one is
the best one that gives the best
performance for our dataset, okay? So
finally, we come to this conclusion, this
particular model is not able to reduce
the recall on failures less than
95.18%. On the other hand, balanced begging
with a decision thresold of 0.6 is able
to have a better recall blah, blah, blah,
etc. So finally, after having done all of
this evaluations,
okay, this is the conclusion.
So after having gone- so right now we
have gone through all the steps of the
machine learning life cycle which
means we have right now, or the data
scientist right now has gone through all
these steps
which is now we have done this
validation. So we have done the cleaning,
exploration, preparation, transformation,
the feature engineering, we have developed
and trained multiple models, we have
evaluated all these different models, so
right now we have reached this stage, so
at this stage we as the data scientist,
kind of, have completed our job. So we've
come to some very useful conclusions
which we now can share with our
colleagues, all right? And based on these
conclusions or recommendations,
somebody is going to choose a
appropriate model, and that model is
going to get deployed for real-time use
in a real life production environment,
okay? And that decision is going to be
made based on the recommendations coming
from the data scientist at the end of
this phase, okay? So at the end of this
phase, the data scientist is going to
come up with these conclusions. So
conclusions is, okay, if the engineering
team they are looking, okay? The
engineering team, right? The engineering
team, if they are looking for the highest
failure detection rate possible, then
they should go with this particular
model, okay?
And if they want a balance between
precision and recall, then they should
choose between the bagging model with a
0.4 decision threshold or the random
forest model with a 0.5 threshold, but if
they don't care so much about predicting
every failure, and they want the highest
precision possible, then they should opt
for the bagging tomek links classifier
with a bit higher decision threshold. And
so this is the key thing that the data
scientist is going to give, right? This is
the key takeaway. This is the, kind of, the
end result of the entire machine
learning life cycle. Right now the data
scientist is going to tell the
engineering team, all right you guys,
which is more important for you, point A,
point B, or point C. Make your decision. So
the engineering team will then discuss
among themselves and say, hey you know
what? What we want is we want to get the
highest failure detection possible
because any kind of failure of that
machine or the product or the assembly
line is really going to screw us up big
time. So what we're looking for is the
model that will give us the highest
failure detection rate. We don't care
about precision, but we want to be make
sure that if there's a failure, we are
going to catch it, right? So that's what
they want, and so the data scientist will
say, hey you go for the balanced bagging
model, okay? Then, the data scientist saves
this, all right. And then, once you have
saved this, you can then go right
ahead and deploy that. So you can go
right ahead and deploy that to
production. Okay, and so if you want to
continue, we can actually further
continue this modeling problem. So just
now, I model this problem as a binary
classification problem. Uh, sorry. I
modeled this problem as a binary
classification which means it's either
zero or one, either fail or not fail, but
we can also model it as a multiclass
classification problem, right, because
as I said earlier just now for the
target variable column which is- sorry, for
the failure type column, you actually
have multiple kinds of failures, right?
For example, you may have a power failure,
you may have a tool wear failure, you
may have a overstrain failure. So now we
can model the problem slightly
differently, so we can model it as a
multiclass classification problem, and
then we go through the entire same
process that we went through just now, so
we create different models, we test this
out, but now the confusion matrix is for
a multiclass classification issue, right?
So we're going
to check them out. We're going to, again,
try different algorithms or models.
Again, train and test our dataset, do the
training test split on these
different models. All right, so we have
like, for example, we have balanced random
forest, balanced random forest grid search,
then you train the models using what is
called hyperparameter tuning, then you
get the scores. All right, so you get the
same evaluation scores again. You check
out the evaluation scores, compare
between them, generate a confusion matrix,
so this is a multiclass confusion matrix.
And then, you come to the final
conclusion. So now if you are interested
to frame your problem domain as a
multiclass classification problem, all
right, then these are the recommendations
from the data scientist. So the data
scientist will say, you know what, I'm
going to pick this particular model, the
balanced bagging classifier, and these are
all the reasons that the data scientist
is going to give as a rational for
selecting this particular
model. And then once that's done, you save
the model and that's it, that's it.
So that's all done now, and so then the
the model, the machine learning model,
now you can put it live, run it on the
server, and now the machine learning
model is ready to work which means it's
ready to generate predictions, right?
That's the main job of the machine
learning model. You have picked the best
machine learning model with the best
evaluation metrics for whatever accuracy
goal you're trying to achieve. And
now you're going to run it on a server,
and now you're going to get all this
real-time data that's coming from your
sensors, you're going to pump that into
your machine learning model, your machine
learning model will pump out a whole
bunch of predictions, and we're going to
use that predictions in real-time to
make real-time, real-world decision
making, right? You're going to say, okay
I'm predicting that that machine is
going to fail on Thursday at 5:00 p.m.,
so you better get your service folks in
to service it on Thursday 2 p.m. or, you
know, whatever. So you can, you know,
make decisions on when you want to do
your maintenance, you know, and make
the best decisions to optimize the cost
of maintenance, etc, etc. And then based on
the
results that are coming up from the
predictions, so the predictions may be
good, the predictions may be lousy, the
predictions may be average, right? So
we're constantly monitoring how good
or how useful are the predictions
generated by this real-time model that's
running on the server, and based on our
monitoring, we will then take some new
data and then repeat this entire life
cycle again, so this is basically a
workflow that's iterative, and we are
constantly or the data scientist is
constantly getting in all these new data
points and then refining the model,
picking maybe a new model, deploying the
new model onto the server, and so on. All
right, and so that's it. So that is
basically your machine learning workflow
in a nutshell. Okay so for this
particular approach we have used a bunch
of data science libraries from Python,
so we have used Pandas which is the most
basic data science libraries that
provides all the tools to work with raw
data. We have used Numpy which is a high
performance library for implementing
complex array matrix operations. We have
used Matplotlib and Seaborn which is used
for doing the EDA the
exploratory data analysis phase of machine
learning where you visualize all your
data. We have used Scikit learn which is
the machine learning library to do all
your implementation for all your core
machine learning algorithms. We
have not used this because this is not a
deep learning problem, but if you are
working with a deep learning problem
like image classification, image
recognition, object detection, okay,
natural language processing, text
classification, well then you're going to
use these libraries from Python which is
Tensorflow, okay, and also Pytorch.
And then lastly, that whole thing, that
whole data science project that you saw
just now, this entire data science
project is actually developed in
something called a Jupyter notebook. So
all this Python code along with all the
observations from the data
scientists, okay, for this entire data
science project was actually run in
something called a Jupyter notebook. So
that is the
most widely used tool for interactively
developing and presenting data science
projects. Okay so that brings me to the
end of this entire presentation. I hope
that you find it useful for you, and that
you can appreciate the importance of
machine learning, and how it can be
applied in a real life use case in a
typical production environment. All right,
thank you all so much for watching.