Hello everyone, my name is Victor. I'm

your friendly neighborhood data

scientist from DreamCatcher. So in this

presentation, I would like to talk about

a specific industry use case of AI or

machine learning which is predictive

maintenance. So I will be covering these

topics and feel free to jump forward to

the specific part in the video where I

talk about all these topics. So I'm going

to start off with a general preview of

AI and machine learning. Then, I'll

discuss the use case which is predictive

maintenance. I'll talk about the basics

of machine learning, the workflow of

machine learning, and then we will come

to the meat of this presentation which

is essentially a demonstration of the

machine learning workflow from end to

end on a real life predictive

maintenance domain problem. All right, so

without any further ado, let's jump into

it. So let's start off with a quick

preview of AI and machine learning. Well

AI is a very general term, it encompasses

the entire area of science and

engineering that is related to creating

software programs and machines that

will be capable of performing tasks

that would normally require human

intelligence. But AI is a catchall term,

so really when we talk about apply AI,

how we use AI in our daily work, we are

really going to be talking about machine

learning. So machine learning is the

design and application of software

algorithms that are capable of learning

on their own without any explicit human

intervention. And the primary purpose of

these algorithms are to optimize

performance in a specific task. And the

primary performance or the primary task

that you want to optimize performance in

is to be able to make accurate

predictions about future outcomes based

on the analysis of historical data

from the past. So essentially machine

learning is about making predictions

about the future or what we call

predictive analytics.

And there are many different

kinds of algorithms that are available in

machine learning under the three primary

categories of supervised learning,

unsupervised learning, and reinforcement

learning. And here we can see some of the

different kinds of algorithms and their

use cases in various areas in

industry. So we have various domain use

cases

for all these different kind of

algorithms, and we can see that different

algorithms are fitted for different use cases.

Deep learning is an advanced form

of machine learning that's based on

something called an artificial neural

network or ANN for short, and this

essentially simulates the structure of

the human brain whereby neurons

interconnect and work together to

process and learn new information. So DL

is the foundational technology for most

of the popular AI tools that you

probably have heard of today. So I'm sure

you have heard of ChatGPT if you haven't

been living in a cave for the past 2

years. And yeah, so ChatGPT is an example

of what we call a large language model

and that's based on this technology

called deep learning. Also, all the modern

computer vision applications where a

computer program can classify images or

detect images or recognize images on

its own, okay, we call this computer

vision applications. They also use

this particular form of machine learning

called deep learning, right? So this is a

example of an artificial neural network.

For example, here I have an image of a

bird that's fed into this artificial

neural network, and output from this

artificial neural network is a

classification of this image into one of

these three potential categories. So in

this case, if the ANN has been trained

properly, we fit in this image, this

ANN should correctly classify this image

as a bird, right? So this is a image

classification problem which is a

classic use case for an artificial

neural network in the field of computer

vision. And just like in the case of

machine learning, there are a variety of

algorithms that are available for

deep learning under the category of

supervised learning and also

unsupervised learning.

All right, so this is how we can

kind of categorize this. You can think of

AI is a general area of smart systems

and machine. Machine learning is

basically apply AI and deep learning

is a

subspecialization of machine learning

using a particular architecture called

an artificial neural network.

And generative AI, so if you talk

about ChatGPT, okay, Google Gemini,

Microsoft Copilot, okay, all these

examples of generative AI, they are

basically large language models, and they

are a further subcategory within the

area of deep

learning. And there are many applications

of machine learning in industry right

now, so pick which particular industry

are you involved in, and these are all the

specific areas of

applications, right? So probably, I'm

going to guess the vast majority of you

who are watching this video, you're

probably coming from the manufacturing

industry, and so in the manufacturing

industry some of the standard use cases

for machine learning and deep learning

are predicting potential problems, okay?

So sometimes you call this predictive

maintenance where you want to predict

when a problem is going to happen and

then kind of address it before it

happens. And then monitoring systems,

automating your manufacturing assembly

line or production line, okay, smart

scheduling, and detecting anomaly on your

production line.

Okay, so let's talk about the use

case here which is predictive

maintenance, right? So what is predictive

maintenance? Well predictive maintenance,

here's the long definition, is a

equipment maintenance strategy that

relies on real-time monitoring of

equipment conditions and data to predict

equipment failures in advance.

And this uses advanced data models,

analytics, and machine learning whereby

we can reliably assess when failures are

more likely to occur, including which

components are more likely to be

affected on your production or assembly

line. So where does predictive

maintenance fit into the overall scheme

of things, right? So let's talk about the

kind of standard way that, you know,

factories or production

lines, assembly lines in factories tend

to handle maintenance issues say

10 or 20 years ago, right? So what you

have is the, what you would probably

start off is the most basic mode

which is reactive maintenance. So you

just wait until your machine breaks down

and then you repair, right? The simplest,

but, of course, I'm sure if you have worked on a

production line for any period of time,

you know that this reactive maintenance

can give you a whole bunch of headaches

especially if the machine breaks down

just before a critical delivery deadline,

right? Then you're going to have a

backlog of orders and you're going to

run to a lot of problems. Okay, so we move on

to preventive maintenance which is

you regularly schedule a maintenance of

your production machines to reduce

the failure rate. So you might do

maintenance once every month, once every

two weeks, whatever. Okay, this is great,

but the problem, of course, then is well

sometimes you're doing too much

maintenance, it's not really necessary,

and it still doesn't totally prevent

this, you know, a failure of the

machine that occurs outside of your planned

maintenance, right? So a bit of an

improvement, but not that much better.

And then, these last two categories is

where we bring in AI and machine

learning. So with machine learning, we're

going to use sensors to do real-time

monitoring of the data, and then using

that data we're going to build a machine

learning model which helps us to predict,

with a reasonable level of accuracy, when

the next failure is going to happen on

your assembly or production line on a

specific component or specific machine,

right? So you just want to be predict to

a high level of accuracy like maybe

to the specific day, even the specific

hour, or even minute itself when you

expect that particular product to fail

or the particular machine to fail. All

right, so these are the advantages of

predictive maintenance. It minimizes

the occurrence of unscheduled downtime, it

gives you a real-time overview of your

current condition of assets, ensures

minimal disruptions to productivity,

optimizes time you spend on maintenance work,

optimizes the use of spare parts, and so

on. And of course there are some

disadvantages, which is the

primary one, you need a specialized set

of skills among your engineers to

understand and create machine learning

models that can work on the real-time

data that you're getting. Okay, so we're

going to take a look at some real life

use cases. So these are a bunch of links

here, so if you navigate to these links

here, you'll be able to get a look at

some real life use cases of machine

learning in predictive maintenance. So

the IBM website, okay, gives you a look at

a bunch of five use cases, so you can

click on these links and follow up with

them if you want to read more. Okay, this

is waste management, manufacturing, okay,

building services, and renewable energy,

and also mining, right? So these are all

use cases, if you want to know more about

them, you can read up and follow them

from this website. And this website

gives, this is a pretty good website. I

would really encourage you to just look

through this if you're interested in

predictive maintenance. So here, it tells

you about, you know, an industry survey of

predictive maintenance. We can see that a

large portion of the industry,

manufacturing industry agreed that

predictive maintenance is a real need to

stay competitive and predictive

maintenance is essential for

manufacturing industry and will gain

additional strength in the future. So

this is a survey that was done quite

some time ago and this was the results

that we got back. So we can see the vast

majority of key industry players in the

manufacturing sector, they consider

predictive maintenance to be a very

important

activity that they want to

incorporate into their workflow, right?

And we can see here the kind of ROI that

we expect on investment in predictive

maintenance, so 45% reduction in downtime,

25% growth in productivity, 75% fault

elimination, 30% reduction in maintenance

cost, okay? And best of all, if you really

want to kind of take a look at examples,

all right, so there are all these

different companies that have

significantly invested in predictive

maintenance technology in their

manufacturing processes. So PepsiCo, we

have got Frito-Lay, General Motors, Mondi, Ecoplant,

all right? So you can jump over here

and take a look at some of these

use cases. Let me perhaps, let me try and

open this up, for example, Mondi, right? You

can see Mondi has impl- oops. Mondi has used

this particular piece of software

called MATLAB, all right, or MathWorks

sorry, to do predictive maintenance

for their manufacturing processes using

machine learning. And we can talk, you can

study how they have used it, all right,

and how it works, what was their

challenge, all right, the problems they

were facing, the solution that they use

using this MathWorks Consulting piece of

software, and data that they collected in

a MATLAB database, all right, sorry

in a Oracle database.

So using MathWorks from MATLAB, all

right, they were able to create a deep

learning model to, you know, to

solve this particular issue for their

domain. So if you're interested, please, I

strongly encourage you to read up on all

these real life customer stories with

showcase use cases for predictive

maintenance. Okay, so that's it for

real life use cases for predictive maintenance.

Now in this topic, I'm

going to talk about machine learning

basics, so what is actually involved

in machine learning, and I'm going to

give a very quick, fast, conceptual, high

level overview of machine learning, all

right? So there are several categories of

machine learning, supervised, unsupervised,

semi-supervised, reinforcement, and deep

learning, okay? And let's talk about the

most common and widely used category of

machine learning which is called

supervised learning. So the particular use

case here that I'm going to be

discussing, predictive maintenance, it's

basically a form of supervised learning.

So how does supervised learning work?

Well in supervised learning, you're going

to create a machine learning model by

providing what is called a labelled data

set as a input to a machine learning

program or algorithm. And this dataset

is going to contain what is called an

independent or feature variables, all

right, so this will be a set of variables.

And there will be one dependent or

target variable which we also call the

label, and the idea is that the

independent or the feature variables are

the attributes or properties of your

data set that influence the dependent or

the target variable, okay? So this process

that I've just described is called

training the machine learning model, and

the model is fundamentally a

mathematical function that best

approximates the relationship between

the independent variables and the

dependent variable. All right, so that's

quite a bit of a mouthful, so let's jump

into a diagram that maybe illustrates

this more clearly. So let's say you have

a dataset here, an Excel spreadsheet,

right? And this Excel spreadsheet has a

bunch of columns here and a bunch of

rows, okay? So these rows here represent

observations, or these rows are what

we call observations or samples or data

points in our data set, okay? So let's

assume this data set is gathered by a

marketing manager at a mall, at a retail

mall, all right? So they've got all this

information about the customers who

purchase products at this mall, all right?

So some of the information they've

gotten about the customers are their

gender, their age, their income, and the

number of children. So all this

information about the customers, we call

this the independent or the feature

variables, all right? And based on all

this information about the customer, we

also managed to get some or we record

the information about how much the

customer spends, all right? So this

information or these numbers here, we call

this the target variable or the

dependent variable, right? So on the

single row, the data point, one single sample, one

single data point, contains all the data

for the feature variables and one single

value for the label or the target

variable, okay? And the primary purpose of

the machine learning model is to create

a mapping from all your feature

variables to your target variable, so

somehow there's going to be a function,

okay, this will be a mathematical

function that maps all the values of

your feature variable to the value of

your target variable. In other words, this

function represents the relationship

between your feature variables and your

target variable, okay? So this whole thing,

this training process, we call this the

fitting the model. And the target

variable or the label, this thing here,

this column here, or the values here,

these are critical for providing a

context to do the fitting or the

training of the model. And once you've

got a trained and fitted model, you can

then use the model to make an accurate

prediction of target values

corresponding to new feature values that

the model has yet to encounter or yet to

see, and this, as I've already said

earlier, this is called predictive

analytics, okay? So let's see what's

actually happening here, you take your

training data, all right, so this is this

whole bunch of data, this data set here

consisting of a thousand rows of

data, 10,000 rows of data, you take this

entire data set, all right, this entire

data set, you jam it into your machine

learning algorithm, and a couple of hours

later your machine learning algorithm

comes up with a model. And the model is

essentially a function that maps all

your feature variables which is these

four columns here, to your target

variable which is this one single column

here, okay? So once you have the model, you

can put in a new data point. So basically

the new data point represents data about a

new customer, a new customer that you

have never seen before. So let's say

you've already got information about

10,000 customers that have visited this

mall and how much each of these 10,000

customers have spent when they are at this

mall. So now you have a totally new

customer that comes in the mall, this

customer has never come into this mall

before, and what we know about this

customer is that he is a male, the age is

50, the income is 18, and they have nine

children. So now when you take this data

and you pump that into your model, your

model is going to make a prediction, it's

going to say, hey, you know what? Based on

everything that I have been trained before

and based on the model I've developed,

I am going to predict that a customer

that is of a male gender, of the age 50

with the income of 18, and nine children,

that customer is going to spend 25 ringgit

at the mall. And this is it, this is what

you want. Right there, right here,

can you see here? That is the final

output of your machine learning model.

It's going to make a prediction about

something that it has not ever seen

before, okay? That is the core, this is

essentially the core of machine learning.

Predictive analytics, making prediction

about the future

based on a historical data set.

Okay, so there are two areas of

supervised learning, regression and

classification. So regression is used to

predict a numerical target variable, such

as the price of a house or the salary of

an employee, whereas classification is

used to predict a categorical target

variable or class label, okay? So for

classification you can have either

binary or multiclass, so, for example,

binary will be just true or false, zero

or one. So whether your machine is going

to fail or is it not going to fail, right?

So just two classes, two possible,

outcomes, or is the customer going to

make a purchase or is the customer not

going to make a purchase. We call this

binary classification. And then for

multiclass, when there are more than two

classes or types of values. So, for

example, here this would be a

classification problem. So if you have a

data set here, you've got information

about your customers, you've got your

gender of the customer, the age of the

customer, the salary of the customer, and

you also have record about whether the

customer made a purchase or not, okay? So

you can take this data set to train a

classification model, and then the

classification model can then make a

prediction about a new customer, and

they're going to predict zero which

means the customer didn't make a

purchase or one which means the customer

make a purchase, right? And regression,

this is regression, so let's say you want

to predict the wind speed, and you've got

historical data about all these four

other independent variables or feature

variables, so you have recorded

temperature, the pressure, the relative

humidity, and the wind direction for the

past 10 days, 15 days, or whatever, okay? So

now you are going to train your machine

learning model using this data set, and

the target variable column, okay, this

column here, the label is basically a

number, right? So now with this number,

this is a regression model, and so now

you can put in a new data point, so a new

data point means a new set of values for

temperature, pressure, relative humidity,

and wind direction, and your machine

learning model will then predict the

wind speed for that new data point, okay?

So that's a regression model.

All right. So in this particular topic

I'm going to talk about the workflow of

that's involved in machine learning. So

in the previous slides, I talked about

developing the model, all right? But

that's just one part of the entire

workflow. So in real life when you use

machine learning, there's an end-to-end

workflow that's involved. So the first

thing, of course, is you need to get your

data, and then you need to clean your

data, and then you need to explore your

data. You need to see what's going on in

your data set, right? And your data set,

real life data sets are not trivial, they

are hundreds of rows, thousands of rows,

sometimes millions of rows, billions of

rows, we're talking about billions or

millions of data points especially if

you're using an IoT sensor to get data

in real time. So you've got all these

super large data sets, you need to clean

them, and explore them, and then you need

to prepare them into a right format so

that you can put them into the training

process to create your machine learning

model, and then subsequently you check

how good is the model, right? How accurate

is the model in terms of its ability to

generate predictions for the

future, right? How accurate are the

predictions that are coming up from your

machine learning model. So that's

validating or evaluating your model, and

then subsequently if you determine that

your model is of adequate accuracy to

meet whatever your domain use case

requirements are, right? So let's say the

accuracy that's required for your domain

use case is

85%, okay? If my machine learning model

can give an 85% accuracy rate, I think

it's good enough, then I'm going to

deploy it into real world use case. So

here the machine learning model gets

deployed on the server, and then other,

you know, other data sources are going to

be captured from somewhere. That data is

pump into the machine learning model. The

machine learning model generates

predictions, and those predictions are

then used to make decisions on the

factory floor in real time or in any

other particular scenario. And then you

constantly monitor and update the model,

you get more new data, and then the

entire cycle repeats itself. So that's

your machine learning workflow, okay, in a

nutshell. Here's another example of

the same thing maybe in a slightly

different format, so, again, you have your

data collection and preparation. Here we

talk more about the different kinds of

algorithms that available to create a

model, and I'll talk about this more in

detail when we look at the real world

example of a end-to-end machine learning

workflow for the predictive maintenance

use case. So once you have chosen the

appropriate algorithm, you then have

trained your model, you then have

selected the appropriate train model

among the multiple models. You are

probably going to develop multiple

models from multiple algorithms, you're

going to evaluate them all, and then

you're going to say, hey, you know what?

After I've evaluated and tested that,

I've chosen the best model, I'm going to

deploy the model, all right, so this is

for real life production use, okay? Real

life sensor data is going to be pumped

into my model, my model is going to

generate predictions, the predicted data

is going to used immediately in real

time for real life decision making, and

then I'm going to monitor, right, the

results. So somebody's using the

predictions from my model, if the

predictions are lousy, that goes into the

monitoring, the monitoring system

captures that. If the predictions are

fantastic, well that is also captured by the

monitoring system, and that gets

feedback again to the next cycle of my

machine learning

pipeline. Okay, so that's the kind of

overall view, and here are the kind of

key phases of your workflow. So one of

the important phases is called EDA,

exploratory data analysis and in this

particular phase, you're going to

do a lot of stuff, primarily just to

understand your data set. So like I said,

real life data sets, they tend to be very

complex, and they tend to have various

statistical properties, all right,

statistics is a very important component

of machine learning. So an EDA helps you

to kind of get an overview of your data

set, get an overview of any problems in

your data set like any data that's

missing, the statistical properties of your

data set, the distribution of your data

set, the statistical correlation of

variables in your data set, etc,

etc. Okay, then we have data cleaning or

sometimes you call it data cleansing, and

in this phase what you want to do is

primarily, you want to kind of do things

like remove duplicate records or rows in

your table, you want to make sure that

your data or your data

points or your samples have appropriate IDs,

and most importantly, you want to make

sure there's not too many missing values

in your data set. So what I mean by

missing values are things like that,

right? You have got a data set, and for

some reason there are some cells or

locations in your data set which are

missing values, right? And if you have a

lot of these missing values, then you've

got a poor quality data set, and you're

not going to be able to build a good

model from this data set. You're not

going to be able to train a good machine

learning model from a data set with a

lot of missing values like this. So you

have to figure out whether there are a

lot of missing values in your data set,

how do you handle them. Another thing

that's important in data cleansing is

figuring out the outliers in your data

set. So outliers are things like this,

you know, data points that are very far from

the general trend of data points in your

data set, right? And so there are also

several ways to detect outliers in your

data set, and there are several ways to

handle outliers in your data set.

Similarly as well, there are several ways

to handle missing values in your data

set. So handling missing values, handling

outliers, those are really two very key

importance of data

cleansing, and there are many, many

techniques to handle this, so a data

scientist needs to be acquainted with

all of this. All right, why do I need to

do data cleansing? Well, here is the key

point.

If you have a very poor quality data set,

which means you've got a lot of outliers

which are errors in your data set, or you

got a lot of missing values in your data

set, even though you've got a fantastic

algorithm, you've got a fantastic model,

the predictions that your model is going

to give is absolutely rubbish. It's kind

of like taking water and putting water

into the tank of a Mercedes-Benz. So

Mercedes-Benz is a great car, but if you

take water and put it into your

Mercedes-Benz, it will just die, right? Your

car will just die, it can't run on water,

right? On the other hand, if you have a

Myvi, Myvi is just a lousy, shit car, but if

you take a high octane, good petrol and

you put into a Myvi, the Myvi will just go at,

you know, 100 miles an hour. It would just

completely destroy the Mercedes-Benz in

terms of performance, so it

doesn't really matter what model you're

using here, right? So you can be using the most

fantastic model like the

Mercedes-Benz or machine learning, but if

your data is lousy quality, your

predictions is also going to be rubbish,

okay? So cleansing data set is, in fact,

probably the most important thing that

data scientists need to do and that's

what they spend most of the time doing,

right, building the model, training the

model, getting the right algorithms, and

so on, that's really a small portion of

the actual machine learning workflow,

right? The actual machine learning

workflow, the vast majority of time is on

cleaning and organizing your

data. Then you have something called

feature engineering which is you

preprocess the feature variables of

your original data set prior to using

them to train the model, and this is

either through addition, deletion,

combination, or transformation of these

variables. And then the idea is you want

to improve the predictive accuracy of

the model, and also because some models

can only work with numeric data, so you

need to transform categorical data into

numeric data. All right, so just now, in

the earlier slides, I showed you that you

take your original data set, you pump it

into algorithm, and then a couple of hours

later, you get a machine learning model,

right? So you didn't do anything to your

data set, to the feature variables in

your data set before you pump it into a

machine learning algorithm. So

what I showed you earlier is you just

take the data set exactly as it is and

you just pump it into the algorithm,

couple of hours later, you get a model,

right? But that's not what generally

happens in in real life. In real life,

you're going to take all the original

feature variables from your data set and

you're going to transform them in some

way. So you can see here these are the

columns of data from my original data set,

and before I actually put all these data

points from my original data set into my

algorithm to train and get my model, I

will actually transform them, okay? So the

transformation of these feature variable

values, we call this feature engineering.

And there are many, many techniques to do

feature engineering, so one-hot encoding,

scaling, log transformation,

discretization, date extraction, boolean

logic, etc, etc.

Okay, then finally we do something

called a train-test split, so where we

take our original dataset, right? So this

was the original dataset, and we break

it into two parts, so one is called the

training dataset and the other is

called the test dataset. And the primary

purpose for this is when we feed and

train the machine learning model, we're

going to use what is called the training

dataset, and when we want to evaluate

the accuracy of the model, right? So this

is the key part of your machine learning

life cycle because you are not only just

going to have one possible models

because there are a vast range of

algorithms that you can use to create a

model. So fundamentally you have a wide

range of choices, right, like wide range

of cars, right? You want to buy a car, you

can buy a Myvi, you can buy a Perodua,

you can buy a Honda, you can buy a

Mercedes-Benz, you can buy a Audi, you can

buy a beamer, many, many different cars

that available for you if you want

to buy a car, right? Same thing. With a

machine learning model there are a vast

variety of algorithms that you can

choose from in order to create a model,

and so once you create a model from a

given algorithm you need to say, hey, how

accurate is this model that I've created

from this algorithm. And different

algorithms are going to create different

models with different rates of accuracy.

And so the primary purpose of the test

dataset is to evaluate the accuracy

of the model to see hey, is this model

that I've created using this algorithm,

is it adequate for me to use in a real

life production use case? Okay? So that's

what it's all about. Okay, so this is my

original dataset, I break it into my

feature dataset and

also my target variable column, so my

feature variable columns, the target

variable columns, and then I further break

it into a training dataset and a test

dataset. The training dataset is to use

to train, to create the machine learning

model. And then once the machine learning

model is created, I then use the test

dataset to evaluate the accuracy of the

machine learning model.

All right. And then finally we can

see what are the different parts or

aspects that go into a successful model,

so EDA about 10%, data cleansing about

20%, feature engineering about

25%, selecting a specific algorithm about

10%, and then training the model from

that algorithm about 15%, and then

finally evaluating the model, deciding

which is the best model with the highest

accuracy rate, that's about 20%.

All right, so we have reached the

most interesting part of this

presentation which is the demonstration

of an end-to-end machine learning workflow

on a real life dataset that

demonstrates the use case of predictive

maintenance. So for the data set for

this particular use case, I've used a

data set from Kaggle. So for those of you

are not aware of this, Kaggle is the

world's largest open-source community

for data science and AI, and they have a

large collection of datasets from all

various areas of industry and human

endeavor, and they also have a large

collection of models that have been

developed using these data sets. So here

we have a data set for the particular

use case, predictive maintenance, okay? So

this is some information about the data

set, so in case you do not know how

to get to there, this is the URL to click

on, okay, to get to that dataset. So once

your at the data set here, you can- or the

page for about this dataset, you can see

all the information about this data set,

and you can download the data set in a

CSV format.

Okay, so let's take a look at the

dataset. So this dataset has a total of

10,000 samples, okay? And these are the

feature variables, the type, the product

ID, the air temperature, process

temperature, rotational speed, torque, tool

wear, and this is the target variable,

all right? So the target variable is what

we are interested in, what we are

interested in using to train the machine

learning model, and also what we are

interested to predict, okay? So these are

the feature variables, they describe or

they provide information about this

particular machine on the production

line, on the assembly line, so you might

know the product ID, the type, the air

temperature, process temperature,

rotational speed, torque, tool wear, right? So

let's say you've got a IoT sensor system

that's basically capturing all this data

about a product or a machine on your

production or assembly line, okay? And

you've also captured information about

whether is for a specific sample,

whether that sample experience a

failure or not, okay? So the target value

of zero, okay, indicates that there's no

failure. So zero means no failure, and we

can see that the vast majority of data

points in this data set are no failure.

And here we can see an example here

where you have a case of a failure, so a

failure is marked as a one, positive, and

no failure is marked as zero, negative,

all right? So here we have one type of a

failure, it's called a power failure. And

if you scroll down the data set, you see

there are also other kinds of failures

like a tool wear

failure, we have a overstrain failure

here, for example,

we also have a power failure again,

and so on. So if you scroll down through

these 10,000 data points, or if

you're familiar with using Excel to

filter out values in a column, you can

see that in this particular column here

which is the so-called target variable

column, you are going to have the vast

majority of values as zero which means

no failure, and some of the rows or the

data points you are going to have a

value of one, and for those rows that you

have a value of one, for example,

here you are- Sorry, for example, here you

are going to have different types of

failures, so like I said just now power

failure, tool set failure, etc, etc. So we are

going to go through the entire machine

learning workflow process with this dataset.

So to see an example of that, we are

going to use a- we're going to go to the

code section here, all right, so if I

click on the code section here. And right

down here we have see what is called a

dataset notebook. So this is basically a

Jupyter notebook. Jupyter is basically an

Python application which allows you to

create a Python machine learning

program that basically builds your

machine learning model, assesses or

evaluates its accuracy, and generates

predictions from it, okay? So here we have

a whole bunch of Jupyter notebooks that

are available, and you can select any one

of them. All these notebooks are

essentially going to process the data

from this particular dataset. So if I go

to this code page here, I've actually

selected a specific notebook that I'm

going to run through to demonstrate an

end-to-end machine learning workflow using

various machine learning libraries from

the Python programming language, okay? So

the particular notebook I'm going to

use is this particular notebook here, and

you can also get the URL for that

particular notebook from here.

Okay, so let's quickly do a quick

revision again. What are we trying to do

here? We're trying to build a machine

learning classification model, right? So

we said there are two primary areas of

supervised learning, one is regression

which is used to predict a numerical

target variable, and the second kind of

supervised learning is classification

which is what we're doing here. We're

trying to predict a categorical target

variable, okay? So in this particular

example, we actually have two kinds of

ways we can classify, either a binary

classification or a multiclass

classification. So for binary

classification, we are only going to

classify the product or machine as

either it failed or it did not fail, okay?

So if we go back to the dataset that I

showed you just now, if you look at this

target variable column, there are only

two possible values here. They are either

zero or one. Zero means there's no failure.

One means there's a failure, okay? So this

is an example of a binary classification.

Only two possible outcomes, zero or one,

didn't fail or fail, all right? Two

possible outcomes. And then we can also,

for the same dataset, we can extend it

and make it a multiclass classification

problem, all right? So if we kind of want

to drill down further, we can say that

not only is there a failure, we can

actually say there are different types of

failures, okay? So we have one category of

class that is basically no failure, okay?

Then we have a category for the

different types of failures, right? So you

can have a power failure, you could have

a tool wear failure,

you could have- let's go down

here, you could have a overstrain

failure, and etc, etc. So you can have

multiple classes of failure in addition

to the general overall or the majority

class of no failure, and that would be a

multiclass classification problem. So

with this data set, we are going to see

how to make it a binary classification

problem and also a multiclass

classification problem. Okay, so let's

look at the workflow. So let's say we've

already got the data, so right now we do

have the dataset. This is the dataset

that we have, so let's assume we've

somehow managed to get this dataset

from some IoT sensors that are

monitoring real-time data in our

production environment. On the assembly

line, on the production line we've got

sensors reading data that gives us all

these data that we have in this CSV file.

Okay, so we've already got the data, we've

retrieved the data, now we're going to go

on to the cleaning and exploration part

of your machine learning life cycle. All

right, so let's look at the data cleaning

part. So the data cleaning part, we're

interested in checking for missing

values and maybe removing the rows you

missing values, okay?

So the kind of things we can- sorry,

the kind of things we can do in missing

values, we can remove the rows missing

values, we can put in some new values,

some replacement values which could be a

average of all the values in that that

particular column, etc, etc, we could also try to

identify outliers in our data set and

also there are a variety of ways to deal

with that. So this is called data

cleansing which is a really important

part of your machine learning workflow,

right? So that's where we are now at,

we're doing cleansing, and then we're

going to follow up with

exploration. So let's look at the actual

code that does the cleansing here. So

here we are right at the start of the

machine learning life cycle here, so

this is a Jupyter notebook. So here we

have a brief description of the problem

statement, all right? So this dataset

reflects real life predictive

maintenance encountered industry with

measurements from real equipment. The

features description is taken directly

from the data source set. So here we have

a description of the six key features in

our dataset type which is the quality

of the product, the air temperature, the

process temperature, the rotational speed,

the torque, and the tool wear, all right? So

these are the six feature variables, and

there are the two target variables, so

just now- I showed you just now there's

one target variable which only has two

possible values, either zero or one, okay?

Zero or one means failure or no failure,

so that will be this column here, right?

So let me go all the way back up to here.

So this column here, we already saw it

only has two possible values, it's either zero or

one. And then we also have this column

here, and this column here is basically

the failure type. And so the- we have- as I

already demonstrated just now, we do have

several categories of types of

failure, and so here we call this

multiclass

classification. So we can either build a

binary classification model for this

problem domain, or we can build a

multiclass

classification problem, all right. So this

Jupyter notebook is going to demonstrate

both approaches to us. So first step, we

are going to write all this Python code

that's going to import all the libraries

that we need to use, okay? So this is

basically Python code, okay, and it's

importing the relevant machine learn-

oops. We are importing the relevant

machine learning libraries related to

our domain use case, okay? Then we load in

our dataset, okay, so this our dataset.

We describe it, we have some quick

insights into the dataset. And then

we just take a look at all the variables

of the feature variables, etc, and so on.

What we're doing now is just

doing a quick overview of the dataset,

so this all this Python code here that

we're writing is allowing us, the data

scientist, to get a quick overview of our

dataset, right, okay, like how many varia-

how many rows are there, how many columns

are there, what are the data types of the

columns, what are the name of the columns,

etc, etc. Okay, then we zoom in on to the

target variables. So we look at the

target variables, how many counts

there are of this target variable, and

so on. How many different types of

failures there are. Then you want to

check whether there are any

inconsistencies between the target and

the failure type, etc. Okay, so when you do

all this checking, you're going to

discover there are some discrepancies in

your dataset, so using a specific Python

code to do checking, you're going to say

hey, you know what? There's some errors

here, right? There are nine values that

classify as failure in target variable,

but as no failure in the failure type

variable, so that means there's a

discrepancy in your data point, right?

So these are all the ones that

are discrepancies because the target

variable says one, and we already know

that target variable one is supposed to

mean there is a failure, right? Target

variable one is supposed to mean there is

a failure, so we are kind of expecting to

see the failure classification, but some

rows actually say there's no failure

although the target type is one. Well here

is a classic example of an error that

can very well occur in a dataset, so now

the question is what do you do with

these errors in your dataset, right? So

here the data scientist says, I think it

would make sense to remove those

instances, and so they write some code

then to remove those instances or those

rows or data points from the overall

data set, and same thing we can, again,

check for other issues. So we find there's

another issue here with our data set which

is another warning, so, again, we can

possibly remove them. So you're going to

remove 27 instances or rows from your

overall data set. So your data set has

10,000 rows or data points. You're

removing 27 which is only 0.27 of the

entire dataset. And these were the

reasons why you removed them, okay? So if

you're just removing 0.27% of the

entire dataset, no big deal, right? Still

okay, but you needed to remove them

because these errors right, these

27

errors, okay, data points with errors in

your dataset could really affect the

training of your machine learning model.

So we need to do your data cleansing,

right? So we are actually cleansing now

some kind of data that is

incorrect or erroneous in your original

dataset. Okay, so then we go on to the

next part which is called EDA, right? So

EDA is where we kind of explore our data,

and we want to, kind of, get a visual

overview of our data as a whole, and also

take a look at the statistical

properties of our data. The statistical

distribution of the data in all the

various columns, the correlation between

the variables, between the feature

variables different columns, and also the

feature variable and the target variable.

So all of this is called EDA, and EDA in

a machine learning workflow is typically

done through visualization,

all right? So let's go back here and take

a look, right? So, for example, here we are

looking at correlation, so we plot the

values of all the various feature

variables against each other and look

for potential correlations and patterns

and so on. And all the different shapes

that you see here in this pair plot, okay,

will have different meaning,

statistical meaning, and so the data

scientist has to, kind of, visually

inspect this pair plot, make some

interpretations of these different

patterns that he sees here, all right. So

these are some of the insights that

can be deduced from looking at these

patterns, so, for example, the torque and

rotational speed are highly correlated,

the process temperature and air

temperature also highly correlated, that

failures occur for extreme values of

some features, etc, etc. Then you can plot

certain kinds of charts. This called a

violin chart to, again, get new insights.

For example, regarding the torque and

rotational speed, it can see, again, that

most failures are triggered for much

lower or much higher values than the

mean when they're not failing. So all

these visualizations, they are there, and

a trained data scientist can look at

them, inspect them, and make some kind of

insightful deductions from them, okay?

Percentage of failure, right? The

correlation heat map, okay, between all

these different feature variables, and

also the target

variable, okay? The product types,

percentage of product types, percentage

of failure with respect to the product

type, so we can also kind of visualize

that as well. So certain products have a

higher ratio of failure compared to other

product types, etc. Or, for example, M

tends to fail more than H products, etc,

etc. So we can create a vast variety of

visualizations in the EDA stage, so you

can see here. And, again, the idea of this

visualization is just to give us some

insight, some preliminary insight into

our dataset that helps us to model it

more correctly. So some more insights

that we get into our data set from all

this visualization.

Then we can plot the distribution so we

can see whether it's a normal

distribution or some other kind of

distribution. We can have a box plot

to see whether there are any outliers in

your data set and so on, right? So we can

see from the box plots, we can see

rotational speed and have outliers. So we

already saw outliers are basically a

problem that you may need to kind of

tackle, right? So outliers are an issue,

it's a part of data cleansing. And

so you may need to tackle this, so we may

have to check okay, well where are the

potential outliers so we can analyze

them from the box plot, okay? But then

we can say well they are outliers, but

maybe they're not really horrible

outliers so we can tolerate them or

maybe we want to remove them. So we can

see what our mean and maximum values for

all these with respect to product type,

how many of them are above or highly

correlated with the product type in

terms of the maximum and minimum, okay,

and then so on. So the insight is well we

got 4.8% of the instances are outliers,

so maybe 4.87% is not really that much,

the outliers are not horrible, so we just

leave them in the dataset. Now for a

different dataset, the data scientist

could come to a different conclusion, so

then they would do whatever they've

deemed is appropriate to, kind of, cleanse

the dataset. Okay, so now that we have

done all the EDA, the next thing we're

going to do is we are going to do what

is called feature engineering. So we are

going to transform our original feature

variables and these are our original

feature variables, right? These are our

original feature variables, and we are

going to transform them, all right? We're

going to transform them in some sense

into some other form before we fit this

for training into our machine learning

algorithm, all right? So these are

examples of- let's say these are examples of a

original data set, right? And this is

examples, these are some of the examples,

you don't have to use all of them, but

these are some of the examples of what we

call feature engineering which you can

then transform your original values in

your feature variables to all these

transform values here. So we're going to

pretty much do that here, so we have a

ordinal encoding, we do scaling of the

data so the dataset is scaled, we use a

MinMax scaling, and then finally, we come

to do a modeling. So we have to split our

dataset into a training dataset and a

test dataset. So coming back to here again,

we said that before you train your

model, sorry, before you train your model,

you have to take your original dataset,

now this is a featured engineered dataset.

We're going to break it into two or

more subsets, okay. So one is called the

training dataset that we use to feed

and train a machine learning model. The

second is test dataset to evaluate the

accuracy of the model, okay? So we got

this training dataset, your test dataset,

and we also need

to sample. So from our original data set

we need to sample some points

that go into your training dataset, some

points that go in your test dataset. So

there are many ways to do sampling. One

way is to do stratified sampling where

we ensure the same proportion of data

from each stata or class because right

now we have a multiclass classification

problem, so you want to make sure the

same proportion of data from each strata or

class is equally proportional in the

training and test dataset as the

original dataset which is very useful

for dealing with what is called an

imbalanced dataset. So here we have an

example of what is called an imbalanced

dataset in the sense that you have the

vast majority of data points in your

data set, they are going to have the

value of zero for their target variable

column. So only a extremely small

minority of the data points in your dataset

will actually have the value of one

for their target variable column, okay? So

a situation where you have your class or

your target variable column where the

vast majority of values are from one

class and a tiny small minority are from

another class, we call this an imbalanced

dataset. And for an imbalanced dataset,

typically we will have a specific

technique to do the train test split

which is called stratified sampling, and

so that's what's exactly happening here.

We're doing a stratified split here, so

we are doing a train test split here,

and we are doing a stratified split.

And then now we actually develop the

models. So now we've got the train test

split, now here is where we actually

train the models.

Now in terms of classification there are

a whole bunch of

possibilities, right, that you can use.

There are many, many different algorithms

that we can use to create a

classification model. So these are an

example of some of the more common ones.

Logistic, support vector machine, decision

trees, random forest, bagging, balanced

bagging, boost, ensemble. So all

these are different algorithms which

will create different kinds of models

which will result in different accuracy

measures, okay? So it's the goal of the

data scientist to find the best model

that gives the best accuracy for the

given dataset, for training on that

given dataset. So let's head back, again,

to our machine learning workflow. So

here basically what I'm doing is I'm

creating a whole bunch of models here,

all right? So one is a random forest, one

is balanced bagging, one is a boost

classifier, one's a ensemble classifier,

and using all of these, I am going to

basically feed or train my model using

all these algorithms. And then I'm going

to evaluate them, okay? I'm going to

evaluate how good each of these models

are. And here you can see your

evaluation data, right? Okay and this is

the confusion matrix which is another

way of evaluating. So now we come to the,

kind of, the key part here which

is how do I distinguish between

all these models, right? I've got all

these different models which are built

with different algorithms which I'm

using to train on the same dataset, how

do I distinguish between all these

models, okay? And so for that sense, for

that we actually have a whole bunch of

common evaluation metrics for

classification, right? So this evaluation

metrics tell us how good a model is in

terms of its accuracy in

classification. So in terms of

accuracy, we actually have many different

models, sorry, many different measures,

right? You might think well, accuracy is

just accuracy, well that's all right, it's

just either it's accurate or it's not

accurate, right? But actually it's not

that simple. There are many different

ways to measure the accuracy of a

classification model, and these are some

of the more common ones. So, for example,

the confusion matrix tells us how many

true positives, that means the value is

positive, the prediction is positive, how

many false positives which means the

value is negative the machine learning

model predicts positive. How many false

negatives which means that the machine

learning model predicts negative, but

it's actually positive. And how many true

negatives there are which means that the

the machine learning model

predicts negative and the true value is

also negative. So this is called a

confusion matrix. This is one way we

assess or evaluate the performance of a

classification model,

okay? This is for binary

classification, we can also have

multiclass confusion matrix,

and then we can also measure things like

accuracy. So accuracy is the true

positives plus the true negatives which

is the total number of correct

predictions made by the model divided by

the total number of data points in your

dataset. And then you have also other

kinds of

measures such as recall. And this a

formula for recall, this is a formula for

the F1 score, okay? And then there's

something called the ROC curve, right? So

without going too much in the detail of

what each of these entails, essentially

these are all different ways, these are

different KPI, right? Just like if you

work in a company, you have different KPI,

right? Certain employees have certain KPI

that measures how good or how, you

know, efficient or how effective a

particular employee is, right? So the

KPI for your machine learning models

are ROC curve, F1 score, recall, accuracy,

okay, and your confusion matrix. So

fundamentally after I have built, right,

so here I've built my four different

models. So after I built these four

different models, I'm going to check and

evaluate them using all those different

metrics like, for example, the F1 score,

the precision score, the recall score, all

right. So for this model, I can check out

the ROC score, the F1 score, the precision

score, the recall score. Then for this

model, this is the ROC score, the F1 score,

the precision score, the recall score.

Then for this model and so on. So for

every single model I've created using my

training data set, I will have all my set

of evaluation metrics that I can use to

evaluate how good this model is, okay?

Same thing here, I've got a confusion

matrix here, right, so I can use that,

again, to evaluate between all these four

different models, and then I, kind of,

summarize it up here. So we can see from

this summary here that actually the top

two models, right, which are I'm going to

give a lot, as a data scientist, I'm now

going to just focus on these two models.

So these two models are bagging

classifier and random forest classifier.

They have the highest values of F1 score,

and the highest values of the ROC curve

score, okay? So we can say these are the

top two models in terms of accuracy, okay,

using the F1 evaluation metric and the

ROC AUC evaluation metric, okay? So these

results, kind of, summarize here, and

then we use different sampling

techniques, okay, so just now I talked

about different kinds of sampling

techniques, and so the idea of different

kinds of sampling techniques is to just

get a different feel for different

distributions of the data in different

areas of your dataset, so that you want

to just, kind of, make sure that your

your evaluation of accuracy is actually

statistically correct, right? So we can

do what is called oversampling and under

sampling which is very useful when

you're working with an imbalanced data

set. So this is a example of doing that, and

then here we, again, check out the

results for all these different

techniques we use. The F1 score, the AUC

score, all right, these are the two key

measures of accuracy, right? So and then

we can check out the scores for the

different approaches. Okay so we can see,

oh well, overall the models have lower

ROC AUC score, but they have a much

higher F1 score. The bagging classifier

had the highest ROC AUC score,

but F1 score was too low, okay. Then, in

the data scientist opinion, the random

forest with this particular technique of

sampling has an equilibrium between the F1

ROC, and AUC score. So the takeaway one

is the macro F1 score improves

dramatically using these sampling

techniques, so these models might be better

compared to the balanced ones, all right.

So based on all this evaluation, the

data scientist says they're going to

continue to work with these two models,

all right, and the balanced bagging one,

and then continue to make further

comparisons, all right. So then, we

continue to keep refining on our

evaluation work here. We're going to

train the models one more time again, so

we, again, do a training test split, and

then we do that for this particular

approach model. And then we

print out what is called a

classification report, and this is

basically a summary of all those metrics

that I talk about just now, so, just now,

remember I said there was

several evaluation metrics, right? So

we had the confusion matrix, the

accuracy, the precision, the recall, the AUC

ROC score. So here with the classification

report, I can get a summary of all of

that, so I can see all the values here,

okay, for this particular model, bagging

tomek links. And then, I can do that for

another model, the random forest

borderline SMOTE, and then I can do that

for another model which is the balanced

bagging. So, again, we see this a lot of

comparison between different models

trying to figure out what all these

evaluation metrics are telling us, all

right? Then, again, we have a confusion

matrix. So we generate a confusion matrix

for the bagging with the tomeks links

undersampling, for the random forest

with the borderline SMOTE oversampling,

and just balanced bagging by itself. Then,

again, we compare between these three

models using the confusion matrix,

evaluation matrix, and then we can kind

of come to some conclusions. All right, so,

right, so now we look at all the data,

then we move on and look at another

another kind of evaluation metrics which

is the ROC score, right? So this is one of

the other evaluation metrics I talk

about. So this one is a kind of a curve,

you look at it to see the area

underneath the curve, this is called AOC

ROC area under the curve, sorry, AUC ROC

area under the curve. All right, so the

area under the curve

score will give us some idea about the

threshold that we're going to use for

classification, so we can examine this

for the bagging classifier, for the

random forest classifier, for the balanced

bagging classifier, okay? Then we can also,

again, do that- finally we can check

the classification report of this

particular model. So we keep doing this

over and over again, evaluating this

the matrix, the accuracy matrix, the

evaluation matrix for all these

different models. So we keep doing this

over and over again for different

thresholds or for classification, and so

as we keep drilling into these, we kind

of get more and more understanding of

all these different models, which one is

the best one that gives the best

performance for our dataset, okay? So

finally, we come to this conclusion, this

particular model is not able to reduce

the recall on failures less than

95.18%. On the other hand, balanced begging

with a decision thresold of 0.6 is able

to have a better recall blah, blah, blah,

etc. So finally, after having done all of

this evaluations,

okay, this is the conclusion.

So after having gone- so right now we

have gone through all the steps of the

machine learning life cycle which

means we have right now, or the data

scientist right now has gone through all

these steps

which is now we have done this

validation. So we have done the cleaning,

exploration, preparation, transformation,

the feature engineering, we have developed

and trained multiple models, we have

evaluated all these different models, so

right now we have reached this stage, so

at this stage we as the data scientist,

kind of, have completed our job. So we've

come to some very useful conclusions

which we now can share with our

colleagues, all right? And based on these

conclusions or recommendations,

somebody is going to choose a

appropriate model, and that model is

going to get deployed for real-time use

in a real life production environment,

okay? And that decision is going to be

made based on the recommendations coming

from the data scientist at the end of

this phase, okay? So at the end of this

phase, the data scientist is going to

come up with these conclusions. So

conclusions is, okay, if the engineering

team they are looking, okay? The

engineering team, right? The engineering

team, if they are looking for the highest

failure detection rate possible, then

they should go with this particular

model, okay?

And if they want a balance between

precision and recall, then they should

choose between the bagging model with a

0.4 decision threshold or the random

forest model with a 0.5 threshold, but if

they don't care so much about predicting

every failure, and they want the highest

precision possible, then they should opt

for the bagging tomek links classifier

with a bit higher decision threshold. And

so this is the key thing that the data

scientist is going to give, right? This is

the key takeaway. This is the, kind of, the

end result of the entire machine

learning life cycle. Right now the data

scientist is going to tell the

engineering team, all right you guys,

which is more important for you, point A,

point B, or point C. Make your decision. So

the engineering team will then discuss

among themselves and say, hey you know

what? What we want is we want to get the

highest failure detection possible

because any kind of failure of that

machine or the product or the assembly

line is really going to screw us up big

time. So what we're looking for is the

model that will give us the highest

failure detection rate. We don't care

about precision, but we want to be make

sure that if there's a failure, we are

going to catch it, right? So that's what

they want, and so the data scientist will

say, hey you go for the balanced bagging

model, okay? Then, the data scientist saves

this, all right. And then, once you have

saved this, you can then go right

ahead and deploy that. So you can go

right ahead and deploy that to

production. Okay, and so if you want to

continue, we can actually further

continue this modeling problem. So just

now, I model this problem as a binary

classification problem. Uh, sorry. I

modeled this problem as a binary

classification which means it's either

zero or one, either fail or not fail, but

we can also model it as a multiclass

classification problem, right, because

as I said earlier just now for the

target variable column which is- sorry, for

the failure type column, you actually

have multiple kinds of failures, right?

For example, you may have a power failure,

you may have a tool wear failure, you

may have a overstrain failure. So now we

can model the problem slightly

differently, so we can model it as a

multiclass classification problem, and

then we go through the entire same

process that we went through just now, so

we create different models, we test this

out, but now the confusion matrix is for

a multiclass classification issue, right?

So we're going

to check them out. We're going to, again,

try different algorithms or models.

Again, train and test our dataset, do the

training test split on these

different models. All right, so we have

like, for example, we have balanced random

forest, balanced random forest grid search,

then you train the models using what is

called hyperparameter tuning, then you

get the scores. All right, so you get the

same evaluation scores again. You check

out the evaluation scores, compare

between them, generate a confusion matrix,

so this is a multiclass confusion matrix.

And then, you come to the final

conclusion. So now if you are interested

to frame your problem domain as a

multiclass classification problem, all

right, then these are the recommendations

from the data scientist. So the data

scientist will say, you know what, I'm

going to pick this particular model, the

balanced bagging classifier, and these are

all the reasons that the data scientist

is going to give as a rational for

selecting this particular

model. And then once that's done, you save

the model and that's it, that's it.

So that's all done now, and so then the

the model, the machine learning model,

now you can put it live, run it on the

server, and now the machine learning

model is ready to work which means it's

ready to generate predictions, right?

That's the main job of the machine

learning model. You have picked the best

machine learning model with the best

evaluation metrics for whatever accuracy

goal you're trying to achieve. And

now you're going to run it on a server,

and now you're going to get all this

real-time data that's coming from your

sensors, you're going to pump that into

your machine learning model, your machine

learning model will pump out a whole

bunch of predictions, and we're going to

use that predictions in real-time to

make real-time, real-world decision

making, right? You're going to say, okay

I'm predicting that that machine is

going to fail on Thursday at 5:00 p.m.,

so you better get your service folks in

to service it on Thursday 2 p.m. or, you

know, whatever. So you can, you know,

make decisions on when you want to do

your maintenance, you know, and make

the best decisions to optimize the cost

of maintenance, etc, etc. And then based on

the

results that are coming up from the

predictions, so the predictions may be

good, the predictions may be lousy, the

predictions may be average, right? So

we're constantly monitoring how good

or how useful are the predictions

generated by this real-time model that's

running on the server, and based on our

monitoring, we will then take some new

data and then repeat this entire life

cycle again, so this is basically a

workflow that's iterative, and we are

constantly or the data scientist is

constantly getting in all these new data

points and then refining the model,

picking maybe a new model, deploying the

new model onto the server, and so on. All

right, and so that's it. So that is

basically your machine learning workflow

in a nutshell. Okay so for this

particular approach we have used a bunch

of data science libraries from Python,

so we have used Pandas which is the most

basic data science libraries that

provides all the tools to work with raw

data. We have used Numpy which is a high

performance library for implementing

complex array matrix operations. We have

used Matplotlib and Seaborn which is used

for doing the EDA the

exploratory data analysis phase of machine

learning where you visualize all your

data. We have used Scikit learn which is

the machine learning library to do all

your implementation for all your core

machine learning algorithms. We

have not used this because this is not a

deep learning problem, but if you are

working with a deep learning problem

like image classification, image

recognition, object detection, okay,

natural language processing, text

classification, well then you're going to

use these libraries from Python which is

Tensorflow, okay, and also Pytorch.

And then lastly, that whole thing, that

whole data science project that you saw

just now, this entire data science

project is actually developed in

something called a Jupyter notebook. So

all this Python code along with all the

observations from the data

scientists, okay, for this entire data

science project was actually run in

something called a Jupyter notebook. So

that is the

most widely used tool for interactively

developing and presenting data science

projects. Okay so that brings me to the

end of this entire presentation. I hope

that you find it useful for you, and that

you can appreciate the importance of

machine learning, and how it can be

applied in a real life use case in a

typical production environment. All right,

thank you all so much for watching.