0:00:01.199,0:00:03.760
Hello everyone, my name is Victor. I'm

0:00:03.760,0:00:05.359
your friendly neighborhood data

0:00:05.359,0:00:07.759
scientist from DreamCatcher. So in this

0:00:07.759,0:00:10.160
presentation, I would like to talk about

0:00:10.160,0:00:12.759
a specific industry use case of AI or

0:00:12.759,0:00:15.069
machine learning which is predictive

0:00:15.069,0:00:19.000
maintenance. So I will be covering these

0:00:19.000,0:00:21.320
topics and feel free to jump forward to

0:00:21.320,0:00:23.359
the specific part in the video where I

0:00:23.359,0:00:25.160
talk about all these topics. So I'm going

0:00:25.160,0:00:27.160
to start off with a general preview of

0:00:27.160,0:00:29.080
AI and machine learning. Then, I'll

0:00:29.080,0:00:30.840
discuss the use case which is predictive

0:00:30.840,0:00:32.719
maintenance. I'll talk about the basics

0:00:32.719,0:00:34.800
of machine learning, the workflow of

0:00:34.800,0:00:37.239
machine learning, and then we will come

0:00:37.239,0:00:40.760
to the meat of this presentation which

0:00:40.760,0:00:43.680
is essentially a demonstration of the

0:00:43.680,0:00:45.399
machine learning workflow from end to

0:00:45.399,0:00:47.579
end on a real life predictive

0:00:47.579,0:00:51.520
maintenance domain problem. All right, so

0:00:51.520,0:00:53.640
without any further ado, let's jump into

0:00:53.640,0:00:56.680
it. So let's start off with a quick

0:00:56.680,0:01:00.079
preview of AI and machine learning. Well

0:01:00.079,0:01:03.600
AI is a very general term, it encompasses

0:01:03.600,0:01:06.680
the entire area of science and

0:01:06.680,0:01:09.040
engineering that is related to creating

0:01:09.040,0:01:10.840
software programs and machines that

0:01:10.840,0:01:13.759
will be capable of performing tasks

0:01:13.759,0:01:16.080
that would normally require human

0:01:16.080,0:01:19.600
intelligence. But AI is a catchall term,

0:01:19.600,0:01:22.920
so really when we talk about apply AI,

0:01:22.920,0:01:25.920
how we use AI in our daily work, we are

0:01:25.920,0:01:27.720
really going to be talking about machine

0:01:27.720,0:01:30.000
learning. So machine learning is the

0:01:30.000,0:01:31.680
design and application of software

0:01:31.680,0:01:34.079
algorithms that are capable of learning

0:01:34.079,0:01:37.960
on their own without any explicit human

0:01:37.960,0:01:40.399
intervention. And the primary purpose of

0:01:40.399,0:01:43.280
these algorithms are to optimize

0:01:43.280,0:01:46.840
performance in a specific task. And the

0:01:46.840,0:01:49.680
primary performance or the primary task

0:01:49.680,0:01:52.000
that you want to optimize performance in

0:01:52.000,0:01:54.240
is to be able to make accurate

0:01:54.240,0:01:57.479
predictions about future outcomes based

0:01:57.479,0:02:00.560
on the analysis of historical data

0:02:00.560,0:02:02.960
from the past. So essentially machine

0:02:02.960,0:02:05.320
learning is about making predictions

0:02:05.320,0:02:06.880
about the future or what we call

0:02:06.880,0:02:08.919
predictive analytics.

0:02:08.919,0:02:11.000
And there are many different

0:02:11.000,0:02:12.720
kinds of algorithms that are available in

0:02:12.720,0:02:14.519
machine learning under the three primary

0:02:14.519,0:02:16.440
categories of supervised learning,

0:02:16.440,0:02:18.920
unsupervised learning, and reinforcement

0:02:18.920,0:02:21.440
learning. And here we can see some of the

0:02:21.440,0:02:23.560
different kinds of algorithms and their

0:02:23.560,0:02:27.480
use cases in various areas in

0:02:27.480,0:02:29.680
industry. So we have various domain use

0:02:29.680,0:02:30.480
cases

0:02:30.480,0:02:31.800
for all these different kind of

0:02:31.800,0:02:33.840
algorithms, and we can see that different

0:02:33.840,0:02:38.120
algorithms are fitted for different use cases.

0:02:38.120,0:02:41.000
Deep learning is an advanced form

0:02:41.000,0:02:42.400
of machine learning that's based on

0:02:42.400,0:02:44.280
something called an artificial neural

0:02:44.280,0:02:46.319
network or ANN for short, and this

0:02:46.319,0:02:47.840
essentially simulates the structure of

0:02:47.840,0:02:49.519
the human brain whereby neurons

0:02:49.519,0:02:51.360
interconnect and work together to

0:02:51.360,0:02:54.959
process and learn new information. So DL

0:02:54.959,0:02:57.239
is the foundational technology for most

0:02:57.239,0:02:59.360
of the popular AI tools that you

0:02:59.360,0:03:01.400
probably have heard of today. So I'm sure

0:03:01.400,0:03:03.200
you have heard of ChatGPT if you haven't

0:03:03.200,0:03:05.360
been living in a cave for the past 2

0:03:05.360,0:03:08.280
years. And yeah, so ChatGPT is an example

0:03:08.280,0:03:10.120
of what we call a large language model

0:03:10.120,0:03:11.599
and that's based on this technology

0:03:11.599,0:03:14.879
called deep learning. Also, all the modern

0:03:14.879,0:03:17.440
computer vision applications where a

0:03:17.440,0:03:20.040
computer program can classify images or

0:03:20.040,0:03:23.239
detect images or recognize images on

0:03:23.239,0:03:25.280
its own, okay, we call this computer

0:03:25.280,0:03:27.760
vision applications. They also use

0:03:27.760,0:03:29.519
this particular form of machine learning

0:03:29.519,0:03:31.560
called deep learning, right? So this is a

0:03:31.560,0:03:33.640
example of an artificial neural network.

0:03:33.640,0:03:35.200
For example, here I have an image of a

0:03:35.200,0:03:37.159
bird that's fed into this artificial

0:03:37.159,0:03:39.560
neural network, and output from this

0:03:39.560,0:03:41.239
artificial neural network is a

0:03:41.239,0:03:43.959
classification of this image into one of

0:03:43.959,0:03:46.400
these three potential categories. So in

0:03:46.400,0:03:49.080
this case, if the ANN has been trained

0:03:49.080,0:03:51.799
properly, we fit in this image, this

0:03:51.799,0:03:54.079
ANN should correctly classify this image

0:03:54.079,0:03:56.879
as a bird, right? So this is a image

0:03:56.879,0:03:58.959
classification problem which is a

0:03:58.959,0:04:01.079
classic use case for an artificial

0:04:01.079,0:04:03.929
neural network in the field of computer

0:04:03.929,0:04:07.879
vision. And just like in the case of

0:04:07.879,0:04:09.400
machine learning, there are a variety of

0:04:09.400,0:04:11.640
algorithms that are available for

0:04:11.640,0:04:13.599
deep learning under the category of

0:04:13.599,0:04:15.000
supervised learning and also

0:04:15.000,0:04:16.839
unsupervised learning.

0:04:16.839,0:04:19.199
All right, so this is how we can

0:04:19.199,0:04:20.839
kind of categorize this. You can think of

0:04:20.839,0:04:23.880
AI is a general area of smart systems

0:04:23.880,0:04:26.560
and machine. Machine learning is

0:04:26.560,0:04:29.360
basically apply AI and deep learning

0:04:29.360,0:04:29.823
is a

0:04:29.823,0:04:32.560
subspecialization of machine learning

0:04:32.560,0:04:35.000
using a particular architecture called

0:04:35.000,0:04:38.760
an artificial neural network.

0:04:38.760,0:04:42.160
And generative AI, so if you talk

0:04:42.160,0:04:45.280
about ChatGPT, okay, Google Gemini,

0:04:45.280,0:04:47.639
Microsoft Copilot, okay, all these

0:04:47.639,0:04:49.600
examples of generative AI, they are

0:04:49.600,0:04:51.600
basically large language models, and they

0:04:51.600,0:04:53.880
are a further subcategory within the

0:04:53.880,0:04:55.170
area of deep

0:04:55.170,0:04:57.759
learning. And there are many applications

0:04:57.759,0:04:59.400
of machine learning in industry right

0:04:59.400,0:05:01.759
now, so pick which particular industry

0:05:01.759,0:05:03.680
are you involved in, and these are all the

0:05:03.680,0:05:05.060
specific areas of

0:05:05.060,0:05:09.960
applications, right? So probably, I'm

0:05:09.960,0:05:11.680
going to guess the vast majority of you

0:05:11.680,0:05:12.880
who are watching this video, you're

0:05:12.880,0:05:14.360
probably coming from the manufacturing

0:05:14.360,0:05:16.639
industry, and so in the manufacturing

0:05:16.639,0:05:18.479
industry some of the standard use cases

0:05:18.479,0:05:20.039
for machine learning and deep learning

0:05:20.039,0:05:23.080
are predicting potential problems, okay?

0:05:23.080,0:05:25.319
So sometimes you call this predictive

0:05:25.319,0:05:27.160
maintenance where you want to predict

0:05:27.160,0:05:28.800
when a problem is going to happen and

0:05:28.800,0:05:30.400
then kind of address it before it

0:05:30.400,0:05:32.759
happens. And then monitoring systems,

0:05:32.759,0:05:35.199
automating your manufacturing assembly

0:05:35.199,0:05:37.880
line or production line, okay, smart

0:05:37.880,0:05:40.120
scheduling, and detecting anomaly on your

0:05:40.120,0:05:41.480
production line.

0:05:42.390,0:05:44.160
Okay, so let's talk about the use

0:05:44.160,0:05:45.680
case here which is predictive

0:05:45.680,0:05:49.280
maintenance, right? So what is predictive

0:05:49.280,0:05:51.720
maintenance? Well predictive maintenance,

0:05:51.720,0:05:53.199
here's the long definition, is a

0:05:53.199,0:05:54.639
equipment maintenance strategy that

0:05:54.639,0:05:56.280
relies on real-time monitoring of

0:05:56.280,0:05:58.360
equipment conditions and data to predict

0:05:58.360,0:06:00.280
equipment failures in advance.

0:06:00.280,0:06:02.680
And this uses advanced data models,

0:06:02.680,0:06:05.240
analytics, and machine learning whereby

0:06:05.240,0:06:07.479
we can reliably assess when failures are

0:06:07.479,0:06:09.199
more likely to occur, including which

0:06:09.199,0:06:11.120
components are more likely to be

0:06:11.120,0:06:13.560
affected on your production or assembly

0:06:13.560,0:06:16.599
line. So where does predictive

0:06:16.599,0:06:18.759
maintenance fit into the overall scheme

0:06:18.759,0:06:20.759
of things, right? So let's talk about the

0:06:20.759,0:06:23.039
kind of standard way that, you know,

0:06:23.039,0:06:25.520
factories or production

0:06:25.520,0:06:27.680
lines, assembly lines in factories tend

0:06:27.680,0:06:31.080
to handle maintenance issues say

0:06:31.080,0:06:33.120
10 or 20 years ago, right? So what you

0:06:33.120,0:06:34.520
have is the, what you would probably

0:06:34.520,0:06:36.400
start off is the most basic mode

0:06:36.400,0:06:38.240
which is reactive maintenance. So you

0:06:38.240,0:06:40.680
just wait until your machine breaks down

0:06:40.680,0:06:43.039
and then you repair, right? The simplest,

0:06:43.039,0:06:44.720
but, of course, I'm sure if you have worked on a

0:06:44.720,0:06:46.720
production line for any period of time,

0:06:46.720,0:06:48.880
you know that this reactive maintenance

0:06:48.880,0:06:50.759
can give you a whole bunch of headaches

0:06:50.759,0:06:52.160
especially if the machine breaks down

0:06:52.160,0:06:54.120
just before a critical delivery deadline,

0:06:54.120,0:06:55.520
right? Then you're going to have a

0:06:55.520,0:06:56.800
backlog of orders and you're going to

0:06:56.800,0:06:59.160
run to a lot of problems. Okay, so we move on

0:06:59.160,0:07:00.879
to preventive maintenance which is

0:07:00.879,0:07:03.840
you regularly schedule a maintenance of

0:07:03.840,0:07:07.000
your production machines to reduce

0:07:07.000,0:07:08.800
the failure rate. So you might do

0:07:08.800,0:07:10.520
maintenance once every month, once every

0:07:10.520,0:07:13.120
two weeks, whatever. Okay, this is great,

0:07:13.120,0:07:15.240
but the problem, of course, then is well

0:07:15.240,0:07:16.199
sometimes you're doing too much

0:07:16.199,0:07:18.400
maintenance, it's not really necessary,

0:07:18.400,0:07:20.639
and it still doesn't totally prevent

0:07:20.639,0:07:23.240
this, you know, a failure of the

0:07:23.240,0:07:25.639
machine that occurs outside of your planned

0:07:25.639,0:07:28.680
maintenance, right? So a bit of an

0:07:28.680,0:07:31.160
improvement, but not that much better.

0:07:31.160,0:07:33.280
And then, these last two categories is

0:07:33.280,0:07:34.680
where we bring in AI and machine

0:07:34.680,0:07:36.759
learning. So with machine learning, we're

0:07:36.759,0:07:39.280
going to use sensors to do real-time

0:07:39.280,0:07:41.759
monitoring of the data, and then using

0:07:41.759,0:07:43.319
that data we're going to build a machine

0:07:43.319,0:07:46.479
learning model which helps us to predict,

0:07:46.479,0:07:50.000
with a reasonable level of accuracy, when

0:07:50.000,0:07:52.520
the next failure is going to happen on

0:07:52.520,0:07:54.440
your assembly or production line on a

0:07:54.440,0:07:57.440
specific component or specific machine,

0:07:57.440,0:07:59.520
right? So you just want to be predict to

0:07:59.520,0:08:01.960
a high level of accuracy like maybe

0:08:01.960,0:08:04.440
to the specific day, even the specific

0:08:04.440,0:08:06.400
hour, or even minute itself when you

0:08:06.400,0:08:08.360
expect that particular product to fail

0:08:08.360,0:08:10.960
or the particular machine to fail. All

0:08:10.960,0:08:12.639
right, so these are the advantages of

0:08:12.639,0:08:14.879
predictive maintenance. It minimizes

0:08:14.879,0:08:16.720
the occurrence of unscheduled downtime, it

0:08:16.720,0:08:18.080
gives you a real-time overview of your

0:08:18.080,0:08:19.919
current condition of assets, ensures

0:08:19.919,0:08:22.680
minimal disruptions to productivity,

0:08:22.680,0:08:24.720
optimizes time you spend on maintenance work,

0:08:24.720,0:08:26.639
optimizes the use of spare parts, and so

0:08:26.639,0:08:28.280
on. And of course there are some

0:08:28.280,0:08:30.639
disadvantages, which is the

0:08:30.639,0:08:32.559
primary one, you need a specialized set

0:08:32.559,0:08:35.519
of skills among your engineers to

0:08:35.519,0:08:37.719
understand and create machine learning

0:08:37.719,0:08:40.599
models that can work on the real-time

0:08:40.599,0:08:43.559
data that you're getting. Okay, so we're

0:08:43.559,0:08:45.000
going to take a look at some real life

0:08:45.000,0:08:47.200
use cases. So these are a bunch of links

0:08:47.200,0:08:48.720
here, so if you navigate to these links

0:08:48.720,0:08:50.120
here, you'll be able to get a look at

0:08:50.120,0:08:54.360
some real life use cases of machine

0:08:54.360,0:08:57.640
learning in predictive maintenance. So

0:08:57.640,0:09:00.959
the IBM website, okay, gives you a look at

0:09:00.959,0:09:04.880
a bunch of five use cases, so you can

0:09:04.880,0:09:06.519
click on these links and follow up with

0:09:06.519,0:09:08.279
them if you want to read more. Okay, this

0:09:08.279,0:09:11.480
is waste management, manufacturing, okay,

0:09:11.480,0:09:14.760
building services, and renewable energy,

0:09:14.760,0:09:16.880
and also mining, right? So these are all

0:09:16.880,0:09:18.279
use cases, if you want to know more about

0:09:18.279,0:09:20.480
them, you can read up and follow them

0:09:20.480,0:09:23.600
from this website. And this website

0:09:23.600,0:09:25.760
gives, this is a pretty good website. I

0:09:25.760,0:09:27.720
would really encourage you to just look

0:09:27.720,0:09:28.880
through this if you're interested in

0:09:28.880,0:09:31.160
predictive maintenance. So here, it tells

0:09:31.160,0:09:34.279
you about, you know, an industry survey of

0:09:34.279,0:09:36.360
predictive maintenance. We can see that a

0:09:36.360,0:09:38.200
large portion of the industry,

0:09:38.200,0:09:39.680
manufacturing industry agreed that

0:09:39.680,0:09:41.360
predictive maintenance is a real need to

0:09:41.360,0:09:43.959
stay competitive and predictive

0:09:43.959,0:09:45.240
maintenance is essential for

0:09:45.240,0:09:46.720
manufacturing industry and will gain

0:09:46.720,0:09:48.279
additional strength in the future. So

0:09:48.279,0:09:50.200
this is a survey that was done quite

0:09:50.200,0:09:52.040
some time ago and this was the results

0:09:52.040,0:09:53.880
that we got back. So we can see the vast

0:09:53.880,0:09:55.720
majority of key industry players in the

0:09:55.720,0:09:57.640
manufacturing sector, they consider

0:09:57.640,0:09:59.000
predictive maintenance to be a very

0:09:59.000,0:09:59.839
important

0:09:59.839,0:10:01.600
activity that they want to

0:10:01.600,0:10:04.519
incorporate into their workflow, right?

0:10:04.519,0:10:07.720
And we can see here the kind of ROI that

0:10:07.720,0:10:10.680
we expect on investment in predictive

0:10:10.680,0:10:13.399
maintenance, so 45% reduction in downtime,

0:10:13.399,0:10:17.120
25% growth in productivity, 75% fault

0:10:17.120,0:10:19.480
elimination, 30% reduction in maintenance

0:10:19.480,0:10:22.640
cost, okay? And best of all, if you really

0:10:22.640,0:10:25.040
want to kind of take a look at examples,

0:10:25.040,0:10:26.680
all right, so there are all these

0:10:26.680,0:10:28.120
different companies that have

0:10:28.120,0:10:30.160
significantly invested in predictive

0:10:30.160,0:10:31.640
maintenance technology in their

0:10:31.640,0:10:34.240
manufacturing processes. So PepsiCo, we

0:10:34.240,0:10:38.965
have got Frito-Lay, General Motors, Mondi, Ecoplant,

0:10:38.965,0:10:40.959
all right? So you can jump over here

0:10:40.959,0:10:42.959
and take a look at some of these

0:10:42.959,0:10:46.040
use cases. Let me perhaps, let me try and

0:10:46.040,0:10:48.079
open this up, for example, Mondi, right? You

0:10:48.079,0:10:51.880
can see Mondi has impl- oops. Mondi has used

0:10:51.880,0:10:53.720
this particular piece of software

0:10:53.720,0:10:55.839
called MATLAB, all right, or MathWorks

0:10:55.839,0:10:59.760
sorry, to do predictive maintenance

0:10:59.760,0:11:01.920
for their manufacturing processes using

0:11:01.920,0:11:05.079
machine learning. And we can talk, you can

0:11:05.079,0:11:07.680
study how they have used it, all right,

0:11:07.680,0:11:09.000
and how it works, what was their

0:11:09.000,0:11:10.920
challenge, all right, the problems they

0:11:10.920,0:11:12.639
were facing, the solution that they use

0:11:12.639,0:11:14.560
using this MathWorks Consulting piece of

0:11:14.560,0:11:17.160
software, and data that they collected in

0:11:17.160,0:11:20.399
a MATLAB database, all right, sorry

0:11:20.399,0:11:23.639
in a Oracle database.

0:11:23.639,0:11:26.399
So using MathWorks from MATLAB, all

0:11:26.399,0:11:27.959
right, they were able to create a deep

0:11:27.959,0:11:30.560
learning model to, you know, to

0:11:30.560,0:11:32.839
solve this particular issue for their

0:11:32.839,0:11:35.720
domain. So if you're interested, please, I

0:11:35.720,0:11:37.639
strongly encourage you to read up on all

0:11:37.639,0:11:40.440
these real life customer stories with

0:11:40.440,0:11:43.403
showcase use cases for predictive

0:11:43.403,0:11:48.240
maintenance. Okay, so that's it for

0:11:48.240,0:11:52.200
real life use cases for predictive maintenance.

0:11:53.819,0:11:56.600
Now in this topic, I'm

0:11:56.600,0:11:58.000
going to talk about machine learning

0:11:58.000,0:12:00.040
basics, so what is actually involved

0:12:00.040,0:12:01.480
in machine learning, and I'm going to

0:12:01.480,0:12:03.839
give a very quick, fast, conceptual, high

0:12:03.839,0:12:05.920
level overview of machine learning, all

0:12:05.920,0:12:09.000
right? So there are several categories of

0:12:09.000,0:12:10.959
machine learning, supervised, unsupervised,

0:12:10.959,0:12:13.000
semi-supervised, reinforcement, and deep

0:12:13.000,0:12:15.880
learning, okay? And let's talk about the

0:12:15.880,0:12:19.360
most common and widely used category of

0:12:19.360,0:12:20.560
machine learning which is called

0:12:20.560,0:12:25.040
supervised learning. So the particular use

0:12:25.040,0:12:26.279
case here that I'm going to be

0:12:26.279,0:12:28.560
discussing, predictive maintenance, it's

0:12:28.560,0:12:31.320
basically a form of supervised learning.

0:12:31.320,0:12:33.480
So how does supervised learning work?

0:12:33.480,0:12:35.199
Well in supervised learning, you're going

0:12:35.199,0:12:37.240
to create a machine learning model by

0:12:37.240,0:12:39.360
providing what is called a labelled data

0:12:39.360,0:12:41.680
set as a input to a machine learning

0:12:41.680,0:12:44.680
program or algorithm. And this dataset

0:12:44.680,0:12:46.440
is going to contain what is called an

0:12:46.440,0:12:48.760
independent or feature variables, all

0:12:48.760,0:12:51.240
right, so this will be a set of variables.

0:12:51.240,0:12:52.959
And there will be one dependent or

0:12:52.959,0:12:54.959
target variable which we also call the

0:12:54.959,0:12:57.720
label, and the idea is that the

0:12:57.720,0:12:59.839
independent or the feature variables are

0:12:59.839,0:13:01.600
the attributes or properties of your

0:13:01.600,0:13:04.160
data set that influence the dependent or

0:13:04.160,0:13:07.760
the target variable, okay? So this process

0:13:07.760,0:13:09.120
that I've just described is called

0:13:09.120,0:13:11.600
training the machine learning model, and

0:13:11.600,0:13:14.279
the model is fundamentally a

0:13:14.279,0:13:16.399
mathematical function that best

0:13:16.399,0:13:18.399
approximates the relationship between

0:13:18.399,0:13:20.639
the independent variables and the

0:13:20.639,0:13:22.639
dependent variable. All right, so that's

0:13:22.639,0:13:24.480
quite a bit of a mouthful, so let's jump

0:13:24.480,0:13:26.320
into a diagram that maybe illustrates

0:13:26.320,0:13:27.880
this more clearly. So let's say you have

0:13:27.880,0:13:30.000
a dataset here, an Excel spreadsheet,

0:13:30.000,0:13:32.160
right? And this Excel spreadsheet has a

0:13:32.160,0:13:34.040
bunch of columns here and a bunch of

0:13:34.040,0:13:36.800
rows, okay? So these rows here represent

0:13:36.800,0:13:39.000
observations, or these rows are what

0:13:39.000,0:13:40.959
we call observations or samples or data

0:13:40.959,0:13:43.120
points in our data set, okay? So let's

0:13:43.120,0:13:46.880
assume this data set is gathered by a

0:13:46.880,0:13:49.959
marketing manager at a mall, at a retail

0:13:49.959,0:13:52.279
mall, all right? So they've got all this

0:13:52.279,0:13:54.920
information about the customers who

0:13:54.920,0:13:56.800
purchase products at this mall, all right?

0:13:56.800,0:13:58.519
So some of the information they've

0:13:58.519,0:14:00.000
gotten about the customers are their

0:14:00.000,0:14:01.839
gender, their age, their income, and the

0:14:01.839,0:14:03.600
number of children. So all this

0:14:03.600,0:14:05.680
information about the customers, we call

0:14:05.680,0:14:07.360
this the independent or the feature

0:14:07.360,0:14:10.079
variables, all right? And based on all

0:14:10.079,0:14:12.759
this information about the customer, we

0:14:12.759,0:14:16.199
also managed to get some or we record

0:14:16.199,0:14:17.600
the information about how much the

0:14:17.600,0:14:20.480
customer spends, all right? So this

0:14:20.480,0:14:22.079
information or these numbers here, we call

0:14:22.079,0:14:23.839
this the target variable or the

0:14:23.839,0:14:26.600
dependent variable, right? So on the

0:14:26.600,0:14:29.519
single row, the data point, one single sample, one

0:14:29.519,0:14:32.560
single data point, contains all the data

0:14:32.560,0:14:35.040
for the feature variables and one single

0:14:35.040,0:14:37.800
value for the label or the target

0:14:37.800,0:14:41.199
variable, okay? And the primary purpose of

0:14:41.199,0:14:43.240
the machine learning model is to create

0:14:43.240,0:14:45.519
a mapping from all your feature

0:14:45.519,0:14:48.160
variables to your target variable, so

0:14:48.160,0:14:50.920
somehow there's going to be a function,

0:14:50.920,0:14:52.160
okay, this will be a mathematical

0:14:52.160,0:14:54.800
function that maps all the values of

0:14:54.800,0:14:57.040
your feature variable to the value of

0:14:57.040,0:14:59.639
your target variable. In other words, this

0:14:59.639,0:15:01.279
function represents the relationship

0:15:01.279,0:15:03.360
between your feature variables and your

0:15:03.360,0:15:07.079
target variable, okay? So this whole thing,

0:15:07.079,0:15:08.560
this training process, we call this the

0:15:08.560,0:15:11.320
fitting the model. And the target

0:15:11.320,0:15:13.240
variable or the label, this thing here,

0:15:13.240,0:15:15.120
this column here, or the values here,

0:15:15.120,0:15:17.399
these are critical for providing a

0:15:17.399,0:15:19.000
context to do the fitting or the

0:15:19.000,0:15:21.160
training of the model. And once you've

0:15:21.160,0:15:23.360
got a trained and fitted model, you can

0:15:23.360,0:15:25.959
then use the model to make an accurate

0:15:25.959,0:15:28.319
prediction of target values

0:15:28.319,0:15:30.240
corresponding to new feature values that

0:15:30.240,0:15:32.519
the model has yet to encounter or yet to

0:15:32.519,0:15:34.759
see, and this, as I've already said

0:15:34.759,0:15:36.240
earlier, this is called predictive

0:15:36.240,0:15:38.480
analytics, okay? So let's see what's

0:15:38.480,0:15:40.120
actually happening here, you take your

0:15:40.120,0:15:43.079
training data, all right, so this is this

0:15:43.079,0:15:44.880
whole bunch of data, this data set here

0:15:44.880,0:15:47.440
consisting of a thousand rows of

0:15:47.440,0:15:49.920
data, 10,000 rows of data, you take this

0:15:49.920,0:15:52.040
entire data set, all right, this entire

0:15:52.040,0:15:54.000
data set, you jam it into your machine

0:15:54.000,0:15:56.519
learning algorithm, and a couple of hours

0:15:56.519,0:15:58.079
later your machine learning algorithm

0:15:58.079,0:16:01.360
comes up with a model. And the model is

0:16:01.360,0:16:04.199
essentially a function that maps all

0:16:04.199,0:16:05.959
your feature variables which is these

0:16:05.959,0:16:08.199
four columns here, to your target

0:16:08.199,0:16:10.440
variable which is this one single column

0:16:10.440,0:16:14.279
here, okay? So once you have the model, you

0:16:14.279,0:16:17.040
can put in a new data point. So basically

0:16:17.040,0:16:19.079
the new data point represents data about a

0:16:19.079,0:16:20.959
new customer, a new customer that you

0:16:20.959,0:16:23.120
have never seen before. So let's say

0:16:23.120,0:16:25.079
you've already got information about

0:16:25.079,0:16:27.560
10,000 customers that have visited this

0:16:27.560,0:16:29.920
mall and how much each of these 10,000

0:16:29.920,0:16:31.519
customers have spent when they are at this

0:16:31.519,0:16:34.040
mall. So now you have a totally new

0:16:34.040,0:16:35.800
customer that comes in the mall, this

0:16:35.800,0:16:37.800
customer has never come into this mall

0:16:37.800,0:16:39.839
before, and what we know about this

0:16:39.839,0:16:42.680
customer is that he is a male, the age is

0:16:42.680,0:16:45.199
50, the income is 18, and they have nine

0:16:45.199,0:16:48.160
children. So now when you take this data

0:16:48.160,0:16:50.519
and you pump that into your model, your

0:16:50.519,0:16:52.920
model is going to make a prediction, it's

0:16:52.920,0:16:55.720
going to say, hey, you know what? Based on

0:16:55.720,0:16:57.279
everything that I have been trained before

0:16:57.279,0:16:59.360
and based on the model I've developed,

0:16:59.360,0:17:01.959
I am going to predict that a customer

0:17:01.959,0:17:04.880
that is of a male gender, of the age 50

0:17:04.880,0:17:08.280
with the income of 18, and nine children,

0:17:08.280,0:17:12.400
that customer is going to spend 25 ringgit

0:17:12.400,0:17:15.839
at the mall. And this is it, this is what

0:17:15.839,0:17:18.599
you want. Right there, right here,

0:17:18.599,0:17:21.319
can you see here? That is the final

0:17:21.319,0:17:23.480
output of your machine learning model.

0:17:23.480,0:17:27.359
It's going to make a prediction about

0:17:27.359,0:17:29.760
something that it has not ever seen

0:17:29.760,0:17:32.919
before, okay? That is the core, this is

0:17:32.919,0:17:35.520
essentially the core of machine learning.

0:17:35.520,0:17:38.640
Predictive analytics, making prediction

0:17:38.640,0:17:40.120
about the future

0:17:41.170,0:17:43.799
based on a historical data set.

0:17:44.379,0:17:47.440
Okay, so there are two areas of

0:17:47.440,0:17:49.480
supervised learning, regression and

0:17:49.480,0:17:51.400
classification. So regression is used to

0:17:51.400,0:17:53.440
predict a numerical target variable, such

0:17:53.440,0:17:55.320
as the price of a house or the salary of

0:17:55.320,0:17:57.799
an employee, whereas classification is

0:17:57.799,0:17:59.919
used to predict a categorical target

0:17:59.919,0:18:03.559
variable or class label, okay? So for

0:18:03.559,0:18:05.799
classification you can have either

0:18:05.799,0:18:08.679
binary or multiclass, so, for example,

0:18:08.679,0:18:11.559
binary will be just true or false, zero

0:18:11.559,0:18:14.840
or one. So whether your machine is going

0:18:14.840,0:18:17.360
to fail or is it not going to fail, right?

0:18:17.360,0:18:19.000
So just two classes, two possible,

0:18:19.000,0:18:21.640
outcomes, or is the customer going to

0:18:21.640,0:18:23.679
make a purchase or is the customer not

0:18:23.679,0:18:26.159
going to make a purchase. We call this

0:18:26.159,0:18:28.120
binary classification. And then for

0:18:28.120,0:18:29.679
multiclass, when there are more than two

0:18:29.679,0:18:32.559
classes or types of values. So, for

0:18:32.559,0:18:34.039
example, here this would be a

0:18:34.039,0:18:35.760
classification problem. So if you have a

0:18:35.760,0:18:37.960
data set here, you've got information

0:18:37.960,0:18:39.360
about your customers, you've got your

0:18:39.360,0:18:41.159
gender of the customer, the age of the

0:18:41.159,0:18:42.919
customer, the salary of the customer, and

0:18:42.919,0:18:44.640
you also have record about whether the

0:18:44.640,0:18:47.679
customer made a purchase or not, okay? So

0:18:47.679,0:18:50.080
you can take this data set to train a

0:18:50.080,0:18:52.440
classification model, and then the

0:18:52.440,0:18:54.120
classification model can then make a

0:18:54.120,0:18:56.320
prediction about a new customer, and

0:18:56.320,0:18:58.799
they're going to predict zero which

0:18:58.799,0:19:00.480
means the customer didn't make a

0:19:00.480,0:19:03.159
purchase or one which means the customer

0:19:03.159,0:19:06.320
make a purchase, right? And regression,

0:19:06.320,0:19:08.600
this is regression, so let's say you want

0:19:08.600,0:19:11.280
to predict the wind speed, and you've got

0:19:11.280,0:19:13.799
historical data about all these four

0:19:13.799,0:19:16.559
other independent variables or feature

0:19:16.559,0:19:18.039
variables, so you have recorded

0:19:18.039,0:19:19.640
temperature, the pressure, the relative

0:19:19.640,0:19:21.799
humidity, and the wind direction for the

0:19:21.799,0:19:24.799
past 10 days, 15 days, or whatever, okay? So

0:19:24.799,0:19:26.760
now you are going to train your machine

0:19:26.760,0:19:28.720
learning model using this data set, and

0:19:28.720,0:19:31.679
the target variable column, okay, this

0:19:31.679,0:19:33.760
column here, the label is basically a

0:19:33.760,0:19:37.080
number, right? So now with this number,

0:19:37.080,0:19:39.600
this is a regression model, and so now

0:19:39.600,0:19:41.760
you can put in a new data point, so a new

0:19:41.760,0:19:45.080
data point means a new set of values for

0:19:45.080,0:19:46.960
temperature, pressure, relative humidity,

0:19:46.960,0:19:48.600
and wind direction, and your machine

0:19:48.600,0:19:50.679
learning model will then predict the

0:19:50.679,0:19:53.640
wind speed for that new data point, okay?

0:19:53.640,0:19:57.480
So that's a regression model.

0:19:59.159,0:20:02.280
All right. So in this particular topic

0:20:02.280,0:20:04.919
I'm going to talk about the workflow of

0:20:04.919,0:20:07.960
that's involved in machine learning. So

0:20:07.960,0:20:12.640
in the previous slides, I talked about

0:20:12.640,0:20:14.600
developing the model, all right? But

0:20:14.600,0:20:16.360
that's just one part of the entire

0:20:16.360,0:20:19.080
workflow. So in real life when you use

0:20:19.080,0:20:20.480
machine learning, there's an end-to-end

0:20:20.480,0:20:22.480
workflow that's involved. So the first

0:20:22.480,0:20:24.159
thing, of course, is you need to get your

0:20:24.159,0:20:26.880
data, and then you need to clean your

0:20:26.880,0:20:29.000
data, and then you need to explore your

0:20:29.000,0:20:30.799
data. You need to see what's going on in

0:20:30.799,0:20:33.280
your data set, right? And your data set,

0:20:33.280,0:20:35.720
real life data sets are not trivial, they

0:20:35.720,0:20:38.760
are hundreds of rows, thousands of rows,

0:20:38.760,0:20:40.640
sometimes millions of rows, billions of

0:20:40.640,0:20:43.080
rows, we're talking about billions or

0:20:43.080,0:20:45.120
millions of data points especially if

0:20:45.120,0:20:47.120
you're using an IoT sensor to get data

0:20:47.120,0:20:49.000
in real time. So you've got all these

0:20:49.000,0:20:51.320
super large data sets, you need to clean

0:20:51.320,0:20:53.400
them, and explore them, and then you need

0:20:53.400,0:20:56.360
to prepare them into a right format so

0:20:56.360,0:20:59.600
that you can put them into the training

0:20:59.600,0:21:01.520
process to create your machine learning

0:21:01.520,0:21:04.799
model, and then subsequently you check

0:21:04.799,0:21:07.559
how good is the model, right? How accurate

0:21:07.559,0:21:10.080
is the model in terms of its ability to

0:21:10.080,0:21:12.559
generate predictions for the

0:21:12.559,0:21:14.960
future, right? How accurate are the

0:21:14.960,0:21:16.679
predictions that are coming up from your

0:21:16.679,0:21:18.400
machine learning model. So that's

0:21:18.400,0:21:20.760
validating or evaluating your model, and

0:21:20.760,0:21:22.559
then subsequently if you determine that

0:21:22.559,0:21:25.400
your model is of adequate accuracy to

0:21:25.400,0:21:27.240
meet whatever your domain use case

0:21:27.240,0:21:29.400
requirements are, right? So let's say the

0:21:29.400,0:21:31.440
accuracy that's required for your domain

0:21:31.440,0:21:32.440
use case is

0:21:32.440,0:21:35.320
85%, okay? If my machine learning model

0:21:35.320,0:21:38.520
can give an 85% accuracy rate, I think

0:21:38.520,0:21:40.159
it's good enough, then I'm going to

0:21:40.159,0:21:42.880
deploy it into real world use case. So

0:21:42.880,0:21:45.000
here the machine learning model gets

0:21:45.000,0:21:48.440
deployed on the server, and then other,

0:21:48.440,0:21:50.760
you know, other data sources are going to

0:21:50.760,0:21:52.559
be captured from somewhere. That data is

0:21:52.559,0:21:54.200
pump into the machine learning model. The

0:21:54.200,0:21:55.440
machine learning model generates

0:21:55.440,0:21:57.760
predictions, and those predictions are

0:21:57.760,0:21:59.600
then used to make decisions on the

0:21:59.600,0:22:02.000
factory floor in real time or in any

0:22:02.000,0:22:04.559
other particular scenario. And then you

0:22:04.559,0:22:06.840
constantly monitor and update the model,

0:22:06.840,0:22:09.360
you get more new data, and then the

0:22:09.360,0:22:11.960
entire cycle repeats itself. So that's

0:22:11.960,0:22:14.480
your machine learning workflow, okay, in a

0:22:14.480,0:22:16.919
nutshell. Here's another example of

0:22:16.919,0:22:18.520
the same thing maybe in a slightly

0:22:18.520,0:22:20.039
different format, so, again, you have your

0:22:20.039,0:22:22.159
data collection and preparation. Here we

0:22:22.159,0:22:24.360
talk more about the different kinds of

0:22:24.360,0:22:26.520
algorithms that available to create a

0:22:26.520,0:22:28.120
model, and I'll talk about this more in

0:22:28.120,0:22:30.000
detail when we look at the real world

0:22:30.000,0:22:32.320
example of a end-to-end machine learning

0:22:32.320,0:22:34.559
workflow for the predictive maintenance

0:22:34.559,0:22:36.880
use case. So once you have chosen the

0:22:36.880,0:22:38.840
appropriate algorithm, you then have

0:22:38.840,0:22:41.240
trained your model, you then have

0:22:41.240,0:22:44.080
selected the appropriate train model

0:22:44.080,0:22:46.440
among the multiple models. You are

0:22:46.440,0:22:47.799
probably going to develop multiple

0:22:47.799,0:22:49.559
models from multiple algorithms, you're

0:22:49.559,0:22:51.679
going to evaluate them all, and then

0:22:51.679,0:22:53.200
you're going to say, hey, you know what?

0:22:53.200,0:22:55.279
After I've evaluated and tested that,

0:22:55.279,0:22:57.480
I've chosen the best model, I'm going to

0:22:57.480,0:22:59.640
deploy the model, all right, so this is

0:22:59.640,0:23:02.640
for real life production use, okay? Real

0:23:02.640,0:23:04.279
life sensor data is going to be pumped

0:23:04.279,0:23:06.039
into my model, my model is going to

0:23:06.039,0:23:08.039
generate predictions, the predicted data

0:23:08.039,0:23:10.120
is going to used immediately in real

0:23:10.120,0:23:12.840
time for real life decision making, and

0:23:12.840,0:23:15.000
then I'm going to monitor, right, the

0:23:15.000,0:23:17.440
results. So somebody's using the

0:23:17.440,0:23:19.279
predictions from my model, if the

0:23:19.279,0:23:21.880
predictions are lousy, that goes into the

0:23:21.880,0:23:23.440
monitoring, the monitoring system

0:23:23.440,0:23:25.279
captures that. If the predictions are

0:23:25.279,0:23:27.720
fantastic, well that is also captured by the

0:23:27.720,0:23:29.799
monitoring system, and that gets

0:23:29.799,0:23:32.360
feedback again to the next cycle of my

0:23:32.360,0:23:33.679
machine learning

0:23:33.679,0:23:35.960
pipeline. Okay, so that's the kind of

0:23:35.960,0:23:38.360
overall view, and here are the kind of

0:23:38.360,0:23:41.559
key phases of your workflow. So one of

0:23:41.559,0:23:43.960
the important phases is called EDA,

0:23:43.960,0:23:47.520
exploratory data analysis and in this

0:23:47.520,0:23:49.880
particular phase, you're going to

0:23:49.880,0:23:53.120
do a lot of stuff, primarily just to

0:23:53.120,0:23:54.880
understand your data set. So like I said,

0:23:54.880,0:23:56.559
real life data sets, they tend to be very

0:23:56.559,0:23:59.320
complex, and they tend to have various

0:23:59.320,0:24:01.039
statistical properties, all right,

0:24:01.039,0:24:02.679
statistics is a very important component

0:24:02.679,0:24:05.600
of machine learning. So an EDA helps you

0:24:05.600,0:24:07.480
to kind of get an overview of your data

0:24:07.480,0:24:09.679
set, get an overview of any problems in

0:24:09.679,0:24:11.520
your data set like any data that's

0:24:11.520,0:24:13.440
missing, the statistical properties of your

0:24:13.440,0:24:15.159
data set, the distribution of your data

0:24:15.159,0:24:17.279
set, the statistical correlation of

0:24:17.279,0:24:19.190
variables in your data set, etc,

0:24:19.190,0:24:23.400
etc. Okay, then we have data cleaning or

0:24:23.400,0:24:25.279
sometimes you call it data cleansing, and

0:24:25.279,0:24:27.600
in this phase what you want to do is

0:24:27.600,0:24:29.440
primarily, you want to kind of do things

0:24:29.440,0:24:31.960
like remove duplicate records or rows in

0:24:31.960,0:24:33.679
your table, you want to make sure that

0:24:33.679,0:24:36.799
your data or your data

0:24:36.799,0:24:39.399
points or your samples have appropriate IDs,

0:24:39.399,0:24:41.080
and most importantly, you want to make

0:24:41.080,0:24:43.039
sure there's not too many missing values

0:24:43.039,0:24:44.880
in your data set. So what I mean by

0:24:44.880,0:24:46.320
missing values are things like that,

0:24:46.320,0:24:48.200
right? You have got a data set, and for

0:24:48.200,0:24:51.640
some reason there are some cells or

0:24:51.640,0:24:54.559
locations in your data set which are

0:24:54.559,0:24:56.520
missing values, right? And if you have a

0:24:56.520,0:24:58.679
lot of these missing values, then you've

0:24:58.679,0:25:00.440
got a poor quality data set, and you're

0:25:00.440,0:25:02.200
not going to be able to build a good

0:25:02.200,0:25:04.159
model from this data set. You're not

0:25:04.159,0:25:06.000
going to be able to train a good machine

0:25:06.000,0:25:08.120
learning model from a data set with a

0:25:08.120,0:25:10.200
lot of missing values like this. So you

0:25:10.200,0:25:11.880
have to figure out whether there are a

0:25:11.880,0:25:13.399
lot of missing values in your data set,

0:25:13.399,0:25:15.399
how do you handle them. Another thing

0:25:15.399,0:25:16.919
that's important in data cleansing is

0:25:16.919,0:25:18.799
figuring out the outliers in your data

0:25:18.799,0:25:21.919
set. So outliers are things like this,

0:25:21.919,0:25:24.039
you know, data points that are very far from

0:25:24.039,0:25:26.440
the general trend of data points in your

0:25:26.440,0:25:29.559
data set, right? And so there are also

0:25:29.559,0:25:31.919
several ways to detect outliers in your

0:25:31.919,0:25:34.200
data set, and there are several ways to

0:25:34.200,0:25:36.640
handle outliers in your data set.

0:25:36.640,0:25:38.200
Similarly as well, there are several ways

0:25:38.200,0:25:39.960
to handle missing values in your data

0:25:39.960,0:25:42.880
set. So handling missing values, handling

0:25:42.880,0:25:45.679
outliers, those are really two very key

0:25:45.679,0:25:47.279
importance of data

0:25:47.279,0:25:49.120
cleansing, and there are many, many

0:25:49.120,0:25:50.760
techniques to handle this, so a data

0:25:50.760,0:25:52.000
scientist needs to be acquainted with

0:25:52.000,0:25:55.360
all of this. All right, why do I need to

0:25:55.360,0:25:58.000
do data cleansing? Well, here is the key

0:25:58.000,0:25:59.360
point.

0:25:59.360,0:26:02.799
If you have a very poor quality data set,

0:26:02.799,0:26:04.880
which means you've got a lot of outliers

0:26:04.880,0:26:06.720
which are errors in your data set, or you

0:26:06.720,0:26:08.159
got a lot of missing values in your data

0:26:08.159,0:26:10.840
set, even though you've got a fantastic

0:26:10.840,0:26:13.039
algorithm, you've got a fantastic model,

0:26:13.039,0:26:15.720
the predictions that your model is going

0:26:15.720,0:26:18.960
to give is absolutely rubbish. It's kind

0:26:18.960,0:26:22.080
of like taking water and putting water

0:26:22.080,0:26:26.000
into the tank of a Mercedes-Benz. So

0:26:26.000,0:26:28.440
Mercedes-Benz is a great car, but if you

0:26:28.440,0:26:30.080
take water and put it into your

0:26:30.080,0:26:33.399
Mercedes-Benz, it will just die, right? Your

0:26:33.399,0:26:36.520
car will just die, it can't run on water,

0:26:36.520,0:26:38.279
right? On the other hand, if you have a

0:26:38.279,0:26:41.559
Myvi, Myvi is just a lousy, shit car, but if

0:26:41.559,0:26:44.840
you take a high octane, good petrol STOPPED and

0:26:44.840,0:26:47.240
you point to a MV the MV will just go at

0:26:47.240,0:26:49.480
you know 100 Mil hour it which just

0:26:49.480,0:26:51.159
completely destroy the Mercedes-Benz in

0:26:51.159,0:26:53.360
terms of performance so it doesn't it

0:26:53.360,0:26:54.799
doesn't really matter what model you're

0:26:54.799,0:26:57.080
using right so you can be using the most

0:26:57.080,0:26:58.679
Fantastic Model like the the

0:26:58.679,0:27:01.200
mercedesbenz or machine learning but if

0:27:01.200,0:27:03.080
your data is lousy quality your

0:27:03.080,0:27:06.480
predictions is also going to be rubbish

0:27:06.480,0:27:10.000
okay so cleansing data set is in fact

0:27:10.000,0:27:11.880
probably the most important thing that

0:27:11.880,0:27:13.640
data scientists need to do and that's

0:27:13.640,0:27:15.520
what they spend most of the time doing

0:27:15.520,0:27:17.600
right building the model trading the

0:27:17.600,0:27:20.240
model getting the right algorithms and

0:27:20.240,0:27:23.240
so on that's really a small portion of

0:27:23.240,0:27:25.200
the actual machine learning workflow

0:27:25.200,0:27:27.360
right the actual uh machine learning

0:27:27.360,0:27:29.679
workflow the vast majority of time is on

0:27:29.679,0:27:31.559
cleaning and organizing your

0:27:31.559,0:27:33.360
data then you have something called

0:27:33.360,0:27:35.080
feature engineering which is you

0:27:35.080,0:27:37.000
pre-process the feature variables of

0:27:37.000,0:27:38.919
your original data set prior to using

0:27:38.919,0:27:40.600
them to train the model and this is

0:27:40.600,0:27:41.960
either through addition deletion

0:27:41.960,0:27:43.600
combination or transformation of these

0:27:43.600,0:27:45.399
variables and then the idea is you want

0:27:45.399,0:27:47.000
to improve the predictive accuracy of

0:27:47.000,0:27:49.320
the model and also because some models

0:27:49.320,0:27:51.080
can only work with numeric data so you

0:27:51.080,0:27:53.720
need to transform categorical data into

0:27:53.720,0:27:57.039
numeric data all right so just now um in

0:27:57.039,0:27:58.799
the earlier slides I showed you that you

0:27:58.799,0:28:00.760
take your original data set you pum it

0:28:00.760,0:28:03.200
into algorithm and then couple of hours

0:28:03.200,0:28:05.200
later you get a machine learning model

0:28:05.200,0:28:08.640
right so you didn't do anything to your

0:28:08.640,0:28:10.159
data set to the feature variables in

0:28:10.159,0:28:12.159
your data set before you pump it into a

0:28:12.159,0:28:14.399
machine machine learning algorithm so

0:28:14.399,0:28:15.840
what I showed you earlier is you just

0:28:15.840,0:28:18.919
take the data set exactly as it is and

0:28:18.919,0:28:20.799
you just pump it into the algorithm

0:28:20.799,0:28:23.120
couple of hours later you get the model

0:28:23.120,0:28:27.640
right uh but that's not what generally

0:28:27.640,0:28:29.600
happens in in real life in real life

0:28:29.600,0:28:31.559
you're going to take all the original

0:28:31.559,0:28:34.320
feature variables from your data set and

0:28:34.320,0:28:36.720
you're going to transform them in some

0:28:36.720,0:28:38.960
way so you can see here these are the

0:28:38.960,0:28:42.120
colums of data from my original data set

0:28:42.120,0:28:46.039
and before I actually put all these data

0:28:46.039,0:28:48.240
points from my original data set into my

0:28:48.240,0:28:50.720
algorithm to train and get my model I

0:28:50.720,0:28:54.960
will actually transform them okay so the

0:28:54.960,0:28:57.600
transformation of these feature variable

0:28:57.600,0:29:00.600
values we call this feature engineering

0:29:00.600,0:29:02.440
and there are many many techniques to do

0:29:02.440,0:29:04.960
feature engineering so one hot encoding

0:29:04.960,0:29:08.279
scaling log transformation descri

0:29:08.279,0:29:10.480
discretization date extraction Boolean

0:29:10.480,0:29:12.039
logic etc

0:29:12.039,0:29:14.880
etc okay then finally we do something

0:29:14.880,0:29:16.799
called a train test plate so where we

0:29:16.799,0:29:19.440
take our original data set right so this

0:29:19.440,0:29:21.360
was the original data set and we break

0:29:21.360,0:29:23.720
it into two parts so one is called the

0:29:23.720,0:29:25.760
training data set and the other is

0:29:25.760,0:29:28.120
called the test data set and the primary

0:29:28.120,0:29:30.000
purpose for this is when we feed and

0:29:30.000,0:29:31.399
train the machine learning model we're

0:29:31.399,0:29:32.640
going to use what is called the training

0:29:32.640,0:29:35.559
data set and we when we want to evaluate

0:29:35.559,0:29:37.399
the accuracy of the model right so this

0:29:37.399,0:29:40.960
is the key part of your machine learning

0:29:40.960,0:29:43.640
life cycle because you are not only just

0:29:43.640,0:29:45.440
going to have one possible models

0:29:45.440,0:29:47.720
because there are a vast range of

0:29:47.720,0:29:50.080
algorithms that you can use to create a

0:29:50.080,0:29:53.000
model so fundamentally you have a wide

0:29:53.000,0:29:55.679
range of choices right like wide range

0:29:55.679,0:29:57.640
of cars right you want to buy a car you

0:29:57.640,0:30:00.559
can buy buy a myv you can buy a paroda

0:30:00.559,0:30:02.640
you can buy a Honda you can buy a

0:30:02.640,0:30:05.039
mercedesbenz you can buy a Audi you can

0:30:05.039,0:30:07.760
buy a beamer many many different cars

0:30:07.760,0:30:09.240
you that available for you if you want

0:30:09.240,0:30:11.679
to buy a car right same thing with a

0:30:11.679,0:30:14.360
machine learning model that are aast

0:30:14.360,0:30:16.720
variety of algorithms that you can

0:30:16.720,0:30:19.480
choose from in order to create a model

0:30:19.480,0:30:21.519
and so once you create a model from a

0:30:21.519,0:30:24.480
given algorithm you need to say hey how

0:30:24.480,0:30:26.440
accurate is this model that have created

0:30:26.440,0:30:28.640
from this algorithm and and different

0:30:28.640,0:30:30.399
algorithms are going to create different

0:30:30.399,0:30:33.720
models with different rates of accuracy

0:30:33.720,0:30:35.679
and so the primary purpose of the test

0:30:35.679,0:30:38.200
data set is to evaluate the ACC accuracy

0:30:38.200,0:30:41.480
of the model to see hey is this model

0:30:41.480,0:30:43.360
that I've created using this algorithm

0:30:43.360,0:30:45.880
is it adequate for me to use in a real

0:30:45.880,0:30:48.600
life production use case Okay so that's

0:30:48.600,0:30:52.320
what it's all about okay so this is my

0:30:52.320,0:30:54.279
original data set I break it into my

0:30:54.279,0:30:56.559
feature data uh feature data set and

0:30:56.559,0:30:58.519
also my target variable colum so my

0:30:58.519,0:31:00.639
feature variable uh colums the target

0:31:00.639,0:31:02.200
variable colums and then I further break

0:31:02.200,0:31:04.240
it into a training data set and a test

0:31:04.240,0:31:06.600
data set the training data set is to use

0:31:06.600,0:31:08.320
the train to create the machine learning

0:31:08.320,0:31:10.480
model and then once the machine learning

0:31:10.480,0:31:12.200
model is created I then use the test

0:31:12.200,0:31:15.080
data set to evaluate the accuracy of the

0:31:15.080,0:31:16.279
machine learning

0:31:16.279,0:31:21.000
model all right and then finally we can

0:31:21.000,0:31:23.200
see what are the different parts or

0:31:23.200,0:31:26.080
aspects that go into a successful model

0:31:26.080,0:31:29.519
so Eda about 10% data cleansing about

0:31:29.519,0:31:32.360
20% feature engineering about

0:31:32.360,0:31:36.320
25% selecting a specific algorithm about

0:31:36.320,0:31:39.120
10% and then training the model from

0:31:39.120,0:31:41.639
that algorithm about 15% and then

0:31:41.639,0:31:43.679
finally evaluating the model deciding

0:31:43.679,0:31:45.960
which is the best model with the highest

0:31:45.960,0:31:50.679
accuracy rate that's about

0:31:54.080,0:31:56.919
20% all right so we have reached the

0:31:56.919,0:31:58.880
most interesting part of this

0:31:58.880,0:32:01.039
presentation which is the demonstration

0:32:01.039,0:32:03.760
of an endtoend machine learning workflow

0:32:03.760,0:32:06.080
on a real life data set that

0:32:06.080,0:32:10.080
demonstrates the use case of predictive

0:32:10.080,0:32:13.519
maintenance so the for the data set for

0:32:13.519,0:32:16.240
this particular use case I've used a

0:32:16.240,0:32:19.200
data set from kegle so for those of you

0:32:19.200,0:32:21.399
are not aware of this kegle is the

0:32:21.399,0:32:24.880
world's largest open-source Community

0:32:24.880,0:32:28.080
for data science and Ai and they have a

0:32:28.080,0:32:31.159
large collection of data sets from all

0:32:31.159,0:32:34.440
various uh areas of industry and human

0:32:34.440,0:32:37.039
endeavor and they also have a large

0:32:37.039,0:32:38.840
collection of models that have been

0:32:38.840,0:32:42.880
developed using these data sets so here

0:32:42.880,0:32:47.039
we have a data set for the particular

0:32:47.039,0:32:50.519
use case predictive maintenance okay so

0:32:50.519,0:32:52.919
this is some information about the data

0:32:52.919,0:32:56.440
set uh so in case um you do not know how

0:32:56.440,0:32:59.200
to get to there this is the URL to click

0:32:59.200,0:33:02.240
on okay to get to that data set so once

0:33:02.240,0:33:05.120
you at the data set here you can or the

0:33:05.120,0:33:07.399
page for about this data set you can see

0:33:07.399,0:33:09.960
all the information about this data set

0:33:09.960,0:33:13.039
and you can download the data set in a

0:33:13.039,0:33:14.159
CSV

0:33:14.159,0:33:16.360
format okay so let's take a look at the

0:33:16.360,0:33:19.559
data set so this data set has a total of

0:33:19.559,0:33:23.440
10,000 samples okay and these are the

0:33:23.440,0:33:26.279
feature variables the type the product

0:33:26.279,0:33:28.440
ID the add temperature process

0:33:28.440,0:33:31.000
temperature rotational speed talk tool

0:33:31.000,0:33:34.799
Weare and this is the target variable

0:33:34.799,0:33:36.720
all right so the target variable is what

0:33:36.720,0:33:38.159
we are interested in what we are

0:33:38.159,0:33:40.960
interested in using to train the machine

0:33:40.960,0:33:42.600
learning model and also what we

0:33:42.600,0:33:45.279
interested to predict okay so these are

0:33:45.279,0:33:47.960
the feature variables they describe or

0:33:47.960,0:33:49.960
they provide information about this

0:33:49.960,0:33:52.880
particular machine on the production

0:33:52.880,0:33:55.080
line on the assembly line so you might

0:33:55.080,0:33:56.799
know the product ID the type the air

0:33:56.799,0:33:58.120
temperature process temperature

0:33:58.120,0:34:00.480
rotational speed talk to where right so

0:34:00.480,0:34:03.159
let's say you've got a iot sensor system

0:34:03.159,0:34:06.120
that's basically capturing all this data

0:34:06.120,0:34:08.359
about a product or a machine on your

0:34:08.359,0:34:10.679
production or assembly line okay and

0:34:10.679,0:34:13.918
you've also captured information about

0:34:13.918,0:34:17.199
whether is for a specific uh sample

0:34:17.199,0:34:19.839
whether that sample uh experien a

0:34:19.839,0:34:23.040
failure or not okay so the target value

0:34:23.040,0:34:25.520
of zero okay indicates that there's no

0:34:25.520,0:34:28.000
failure so zero means no failure and we

0:34:28.000,0:34:30.199
can see that the vast majority of data

0:34:30.199,0:34:32.520
points in this data set are no failure

0:34:32.520,0:34:34.000
and here we can see an example here

0:34:34.000,0:34:36.719
where you have a case of a failure so a

0:34:36.719,0:34:40.159
failure is marked as a one positive and

0:34:40.159,0:34:42.639
no failure is marked as zero negative

0:34:42.639,0:34:44.879
all right so here we have one type of a

0:34:44.879,0:34:47.040
failure it's called a power failure and

0:34:47.040,0:34:49.000
if you scroll down the data set you see

0:34:49.000,0:34:50.399
there are also other kinds of failures

0:34:50.399,0:34:52.839
like a towar

0:34:52.839,0:34:56.960
failure uh we have a over strain failure

0:34:56.960,0:34:58.680
here for example

0:34:58.680,0:35:00.760
uh we also have a power failure again

0:35:00.760,0:35:02.200
and so on so if you scroll down through

0:35:02.200,0:35:04.160
these 10,000 data points and or if

0:35:04.160,0:35:06.040
you're familiar with using Excel to

0:35:06.040,0:35:08.839
filter out values in a colume you can

0:35:08.839,0:35:12.280
see that in this particular colume here

0:35:12.280,0:35:14.480
which is the so-called Target variable

0:35:14.480,0:35:16.960
colume you are going to have the vast

0:35:16.960,0:35:18.920
majority of values as zero which means

0:35:18.920,0:35:22.760
no failure and some of the rows or the

0:35:22.760,0:35:24.040
data points you are going to have a

0:35:24.040,0:35:26.359
value of one and for those rows that you

0:35:26.359,0:35:28.119
have a value of one for example example

0:35:28.119,0:35:31.280
here you are sorry for example here you

0:35:31.280,0:35:32.839
are going to have different types of

0:35:32.839,0:35:34.640
failure so like I said just now power

0:35:34.640,0:35:38.960
failure tool set filia etc etc so we are

0:35:38.960,0:35:40.640
going to go through the entire machine

0:35:40.640,0:35:43.599
learning workflow process with this data

0:35:43.599,0:35:46.640
set so to see an example of that we are

0:35:46.640,0:35:50.400
going to use a we're going to go to the

0:35:50.400,0:35:52.280
code section here all right so if I

0:35:52.280,0:35:54.280
click on the code section here and right

0:35:54.280,0:35:56.400
down here we have see what is called a

0:35:56.400,0:35:59.359
data set notebook so this is basically a

0:35:59.359,0:36:02.319
Jupiter notebook Jupiter is basically an

0:36:02.319,0:36:05.280
python application which allows you to

0:36:05.280,0:36:09.240
create a python machine learning

0:36:09.240,0:36:11.680
program that basically builds your

0:36:11.680,0:36:14.520
machine learning model assesses or

0:36:14.520,0:36:16.480
evaluates his accuracy and generates

0:36:16.480,0:36:19.040
predictions from it okay so here we have

0:36:19.040,0:36:21.680
a whole bunch of Jupiter notebooks that

0:36:21.680,0:36:24.560
are available and you can select any one

0:36:24.560,0:36:26.000
of them all these notebooks are

0:36:26.000,0:36:28.720
essentially going to process the data

0:36:28.720,0:36:31.720
from this particular data set so if I go

0:36:31.720,0:36:34.720
to this code page here I've actually

0:36:34.720,0:36:37.319
selected a specific notebook that I'm

0:36:37.319,0:36:39.960
going to run through to demonstrate an

0:36:39.960,0:36:42.839
endtoend machine learning workflow using

0:36:42.839,0:36:45.560
various machine learning libraries from

0:36:45.560,0:36:49.800
the Python programming language okay so

0:36:49.800,0:36:52.440
the uh particular notebook I'm going to

0:36:52.440,0:36:55.160
use is this particular notebook here and

0:36:55.160,0:36:57.160
you can also get the URL for that

0:36:57.160,0:37:00.440
particular The Notebook from

0:37:00.440,0:37:03.760
here okay so let's quickly do a quick

0:37:03.760,0:37:06.000
revision again what are we trying to do

0:37:06.000,0:37:08.000
here we're trying to build a machine

0:37:08.000,0:37:11.359
learning classification model right so

0:37:11.359,0:37:12.960
we said there are two primary areas of

0:37:12.960,0:37:14.560
supervised learning one is regression

0:37:14.560,0:37:16.200
which is used to predict a numerical

0:37:16.200,0:37:18.640
Target variable and the second kind of

0:37:18.640,0:37:21.359
supervised learning is classification

0:37:21.359,0:37:23.079
which is what we're doing here we're

0:37:23.079,0:37:25.839
trying to predict a categorical Target

0:37:25.839,0:37:29.680
variable okay so in this particular

0:37:29.680,0:37:32.119
example we actually have two kinds of

0:37:32.119,0:37:34.480
ways we can classify either a binary

0:37:34.480,0:37:37.560
classification or a multiclass

0:37:37.560,0:37:39.520
classification so for binary

0:37:39.520,0:37:41.440
classification we are only going to

0:37:41.440,0:37:43.400
classify the product or machine as

0:37:43.400,0:37:47.160
either it failed or it did not fail okay

0:37:47.160,0:37:48.880
so if we go back to the data set that I

0:37:48.880,0:37:50.839
showed you just now if you look at this

0:37:50.839,0:37:52.680
target variable colume there are only

0:37:52.680,0:37:54.520
two possible values here they either

0:37:54.520,0:37:58.280
zero or one zero means there's no fi

0:37:58.280,0:38:01.240
one means that's a failure okay so this

0:38:01.240,0:38:03.440
is an example of a binary classification

0:38:03.440,0:38:07.240
only two possible outcomes zero or one

0:38:07.240,0:38:10.119
didn't fail or fail all right two

0:38:10.119,0:38:13.079
possible outcomes and then we can also

0:38:13.079,0:38:15.480
for the same data set we can extend it

0:38:15.480,0:38:18.079
and make it a multiclass classification

0:38:18.079,0:38:20.880
problem all right so if we kind of want

0:38:20.880,0:38:23.720
to drill down further we can say that

0:38:23.720,0:38:26.800
not only is there a failure we can

0:38:26.800,0:38:29.200
actually say that are different types of

0:38:29.200,0:38:32.440
failures okay so we have one category of

0:38:32.440,0:38:35.599
class that is basically no failure okay

0:38:35.599,0:38:37.400
then we have a category for the

0:38:37.400,0:38:40.400
different types of failures right so you

0:38:40.400,0:38:43.920
can have a power failure you could have

0:38:43.920,0:38:46.400
a tool Weare

0:38:46.400,0:38:48.920
failure uh you could have let's go down

0:38:48.920,0:38:50.880
here you could have a over strain

0:38:50.880,0:38:53.760
failure and etc etc so you can have

0:38:53.760,0:38:57.160
multiple classes of failure in addition

0:38:57.160,0:39:00.520
to the general overall or the majority

0:39:00.520,0:39:04.319
class of no failure and that would be a

0:39:04.319,0:39:06.680
multiclass classification problem so

0:39:06.680,0:39:08.400
with this data set we are going to see

0:39:08.400,0:39:11.040
how to make it a binary classification

0:39:11.040,0:39:12.800
problem and also a multiclass

0:39:12.800,0:39:15.079
classification problem okay so let's

0:39:15.079,0:39:16.880
look at the workflow so let's say we've

0:39:16.880,0:39:18.880
already got the data so right now we do

0:39:18.880,0:39:20.839
have the data set this is the data set

0:39:20.839,0:39:22.720
that we have so let's assume we've

0:39:22.720,0:39:24.560
somehow managed to get this data set

0:39:24.560,0:39:26.880
from some iot sensors that are

0:39:26.880,0:39:29.119
monitoring realtime data in our

0:39:29.119,0:39:31.079
production environment on the assembly

0:39:31.079,0:39:32.800
line on the production line we've got

0:39:32.800,0:39:34.680
sensors reading data that gives us all

0:39:34.680,0:39:37.960
these data that we have in this CSV file

0:39:37.960,0:39:40.079
Okay so we've already got the data we've

0:39:40.079,0:39:41.599
retrieved the data now we're going to go

0:39:41.599,0:39:45.000
on to the cleaning and exploration part

0:39:45.000,0:39:47.520
of your machine learning life cycle all

0:39:47.520,0:39:49.800
right so let's look at the data cleaning

0:39:49.800,0:39:51.400
part so the data cleaning part we

0:39:51.400,0:39:53.720
interested in uh checking for missing

0:39:53.720,0:39:56.200
values and maybe removing the rows you

0:39:56.200,0:39:58.079
missing values okay

0:39:58.079,0:39:59.760
uh so the kind of things we can sorry

0:39:59.760,0:40:01.000
the kind of things we can do in missing

0:40:01.000,0:40:02.880
values we can remove the row missing

0:40:02.880,0:40:05.839
values we can put in some new values uh

0:40:05.839,0:40:08.000
some replacement values which could be a

0:40:08.000,0:40:09.880
average of all the values in that that

0:40:09.880,0:40:12.880
particular colume etc etc we also try to

0:40:12.880,0:40:15.480
identify outliers in our data set and

0:40:15.480,0:40:17.480
also there are a variety of ways to deal

0:40:17.480,0:40:19.480
with that so this is called Data

0:40:19.480,0:40:21.359
cleansing which is a really important

0:40:21.359,0:40:23.319
part of your machine learning workflow

0:40:23.319,0:40:25.520
right so that's where we are now at

0:40:25.520,0:40:26.839
we're doing cleansing and then we're

0:40:26.839,0:40:28.839
going to follow up with

0:40:28.839,0:40:31.160
exploration so let's look at the actual

0:40:31.160,0:40:33.160
code that does the cleansing here so

0:40:33.160,0:40:35.800
here we are right at the start of the uh

0:40:35.800,0:40:38.400
machine learning uh life cycle here so

0:40:38.400,0:40:40.839
this is a Jupiter notebook so here we

0:40:40.839,0:40:43.359
have a brief description of the problem

0:40:43.359,0:40:45.920
statement all right so this data set

0:40:45.920,0:40:47.640
reflects real life predictive

0:40:47.640,0:40:49.240
maintenance enounter industry with

0:40:49.240,0:40:50.480
measurements from real equipment the

0:40:50.480,0:40:52.400
features description is taken directly

0:40:52.400,0:40:54.520
from the data source set so here we have

0:40:54.520,0:40:57.400
a description of the six key features in

0:40:57.400,0:40:59.599
our data set type which is the quality

0:40:59.599,0:41:02.520
of the product the air temperature the

0:41:02.520,0:41:04.680
process temperature the rotational speed

0:41:04.680,0:41:06.599
the talk and the towar all right so

0:41:06.599,0:41:08.880
these are the six feature variables and

0:41:08.880,0:41:11.319
there are the two target variables so

0:41:11.319,0:41:13.119
just now I showed you just now there's

0:41:13.119,0:41:15.119
one target variable which only has two

0:41:15.119,0:41:17.440
possible values either zero or one okay

0:41:17.440,0:41:20.079
zero or one means failure or no failure

0:41:20.079,0:41:23.079
so that will be this colume here right

0:41:23.079,0:41:24.880
so let me go all the way back up to here

0:41:24.880,0:41:26.640
so this colume here we already saw it

0:41:26.640,0:41:29.440
only has two I values is either zero or

0:41:29.440,0:41:32.680
one and then we also have this column

0:41:32.680,0:41:35.040
here and this column here is basically

0:41:35.040,0:41:38.079
the failure type and so the we have as I

0:41:38.079,0:41:40.800
already demonstrated just now we do have

0:41:40.800,0:41:43.440
uh several categories of or types of

0:41:43.440,0:41:45.560
failure and so here we call this

0:41:45.560,0:41:47.079
multiclass

0:41:47.079,0:41:50.000
classification so we can either build a

0:41:50.000,0:41:51.839
binary classification model for this

0:41:51.839,0:41:53.520
problem domain or we can build a

0:41:53.520,0:41:55.079
multiclass

0:41:55.079,0:41:58.119
classification problem all right so this

0:41:58.119,0:41:59.839
jupyter notebook is going to demonstrate

0:41:59.839,0:42:02.319
both approaches to us so first step we

0:42:02.319,0:42:04.800
are going to write all this python code

0:42:04.800,0:42:06.880
that's going to import all the libraries

0:42:06.880,0:42:09.079
that we need to use okay so this is

0:42:09.079,0:42:12.319
basically python code okay and it's

0:42:12.319,0:42:15.119
importing the relevant machine learn

0:42:15.119,0:42:17.960
oops we are importing the relevant

0:42:17.960,0:42:20.599
machine learning libraries related to

0:42:20.599,0:42:23.520
our domain use case okay then we load in

0:42:23.520,0:42:26.440
our data set okay so this our data set

0:42:26.440,0:42:28.319
we describe it we have some quick

0:42:28.319,0:42:30.920
insights into the data set um and then

0:42:30.920,0:42:32.839
we just take a look at all the variables

0:42:32.839,0:42:36.000
of the feature variables Etc and so on

0:42:36.000,0:42:38.000
we just what we're doing now is just

0:42:38.000,0:42:39.800
doing a quick overview of the data set

0:42:39.800,0:42:41.559
so this all this python code here they

0:42:41.559,0:42:43.760
were writing is allowing us the data

0:42:43.760,0:42:45.359
scientist to get a quick overview of our

0:42:45.359,0:42:48.359
data set right okay like how many um V

0:42:48.359,0:42:50.240
how many rows are there how many columns

0:42:50.240,0:42:51.760
are there what are the data types of the

0:42:51.760,0:42:53.440
colums what are the name of the columns

0:42:53.440,0:42:57.359
etc etc okay then we zoom in on to the

0:42:57.359,0:42:58.839
Target variables so we look at the

0:42:58.839,0:43:02.000
Target variables how many uh counts

0:43:02.000,0:43:04.520
there are of this target variable uh and

0:43:04.520,0:43:06.440
so on how many different types of

0:43:06.440,0:43:08.240
failures there are then you want to

0:43:08.240,0:43:09.000
check whether there are any

0:43:09.000,0:43:10.760
inconsistencies between the Target and

0:43:10.760,0:43:13.559
the failure type Etc okay so when you do

0:43:13.559,0:43:15.119
all this checking you're going to

0:43:15.119,0:43:16.960
discover there are some discrepancies in

0:43:16.960,0:43:20.280
your data set so using a specific python

0:43:20.280,0:43:21.839
code to do checking you're going to say

0:43:21.839,0:43:23.480
hey you know what there's some errors

0:43:23.480,0:43:25.000
here right there are nine values that

0:43:25.000,0:43:26.599
classify as failure and Target variable

0:43:26.599,0:43:28.200
but as no no failure in the failure type

0:43:28.200,0:43:29.720
variable so that means there's a

0:43:29.720,0:43:33.200
discrepancy in your data point right so

0:43:33.200,0:43:34.760
which are so these are all the ones that

0:43:34.760,0:43:36.359
are discrepancies because the target

0:43:36.359,0:43:39.000
variable says one and we already know

0:43:39.000,0:43:41.240
that Target variable one is supposed to

0:43:41.240,0:43:43.240
mean that it's a failure right target

0:43:43.240,0:43:44.880
varable one is supposed to mean that is

0:43:44.880,0:43:47.119
a failure so we are kind of expecting to

0:43:47.119,0:43:49.680
see the failure classification but some

0:43:49.680,0:43:51.400
rows actually say there's no failure

0:43:51.400,0:43:53.800
although the target type is one but here

0:43:53.800,0:43:55.920
is a classic example of an error that

0:43:55.920,0:43:58.640
can very well Ur in a data set so now

0:43:58.640,0:44:00.559
the question is what do you do with

0:44:00.559,0:44:04.720
these errors in your data set right so

0:44:04.720,0:44:06.240
here the data scientist says I think it

0:44:06.240,0:44:07.520
would make sense to remove those

0:44:07.520,0:44:09.920
instances and so they write some code

0:44:09.920,0:44:12.680
then to remove those instances or those

0:44:12.680,0:44:14.920
uh rows or data points from the overall

0:44:14.920,0:44:17.280
data set and same thing we can again

0:44:17.280,0:44:19.240
check for other ISU so we find there's

0:44:19.240,0:44:21.160
another ISU here with our data set which

0:44:21.160,0:44:24.079
is another warning so again we can

0:44:24.079,0:44:26.240
possibly remove them so you're going to

0:44:26.240,0:44:31.280
remove 20 7 instances or rows from your

0:44:31.280,0:44:34.440
overall data set so your data set has a

0:44:34.440,0:44:37.079
10,000 uh rows or data points you're

0:44:37.079,0:44:40.160
removing 27 which is only 0.27 of the

0:44:40.160,0:44:42.240
entire data set and these were the

0:44:42.240,0:44:45.720
reasons why you remove them okay so if

0:44:45.720,0:44:48.160
you're just removing to uh 0.27% of the

0:44:48.160,0:44:50.800
anti data set no big deal right still

0:44:50.800,0:44:53.079
okay but you needed to remove them

0:44:53.079,0:44:55.720
because these errors right this

0:44:55.720,0:44:58.040
27 um

0:44:58.040,0:45:00.559
errors okay data points with errors in

0:45:00.559,0:45:02.960
your data set could really affect the

0:45:02.960,0:45:05.000
training of your machine learning model

0:45:05.000,0:45:08.640
so we need to do your data cleansing

0:45:08.640,0:45:11.720
right so we are actually cleansing now

0:45:11.720,0:45:15.200
uh uh some kind of data that is

0:45:15.200,0:45:17.520
incorrect or erroneous in your original

0:45:17.520,0:45:21.440
data set okay so then we go on to the

0:45:21.440,0:45:23.839
next part which is called Eda right so

0:45:23.839,0:45:28.880
Eda is where we kind of explore our data

0:45:28.880,0:45:31.720
and we want to kind of get a visual

0:45:31.720,0:45:34.240
overview of our data as a whole and also

0:45:34.240,0:45:35.880
take a look at the statistical

0:45:35.880,0:45:38.160
properties of data the statistical

0:45:38.160,0:45:40.480
distribution of the data in all the

0:45:40.480,0:45:43.079
various colums the correlation between

0:45:43.079,0:45:44.640
the variables between the feature

0:45:44.640,0:45:46.680
variables different columns and also the

0:45:46.680,0:45:48.599
feature variable and the target variable

0:45:48.599,0:45:52.040
so all of this is called Eda and Eda in

0:45:52.040,0:45:54.079
a machine learning workflow is typically

0:45:54.079,0:45:57.160
done through visualization

0:45:57.160,0:45:58.839
all right so let's go back here and take

0:45:58.839,0:46:00.599
a look right so for example here we are

0:46:00.599,0:46:03.400
looking at correlation so we plot the

0:46:03.400,0:46:05.680
values of all the various feature

0:46:05.680,0:46:07.599
variables against each other and look

0:46:07.599,0:46:10.800
for potential correlations and patterns

0:46:10.800,0:46:13.359
and so on and all the different shapes

0:46:13.359,0:46:17.280
that you see here in this pair plot okay

0:46:17.280,0:46:18.400
uh will have different meaning

0:46:18.400,0:46:20.000
statistical meaning and so the data

0:46:20.000,0:46:21.800
scientist has to kind of visually

0:46:21.800,0:46:23.760
inspect this P plot makes some

0:46:23.760,0:46:25.559
interpretations of these different

0:46:25.559,0:46:27.680
patterns that he sees here all right so

0:46:27.680,0:46:30.480
these are some of the insights that that

0:46:30.480,0:46:32.839
can be deduced from looking at these

0:46:32.839,0:46:34.319
pattern so for example the Tor and

0:46:34.319,0:46:36.280
rotational speed are highly correlated

0:46:36.280,0:46:38.040
the process temperature and a

0:46:38.040,0:46:39.920
temperature so highly correlated that

0:46:39.920,0:46:41.559
failures occur for extreme values of

0:46:41.559,0:46:44.520
some features etc etc then you can plot

0:46:44.520,0:46:45.960
certain kinds of charts this called a

0:46:45.960,0:46:48.480
violing chart to again get new insights

0:46:48.480,0:46:49.839
for example regarding the talk and

0:46:49.839,0:46:51.480
rotational speed it can see again that

0:46:51.480,0:46:53.119
most failures are triggered for much

0:46:53.119,0:46:55.119
lower or much higher values than the

0:46:55.119,0:46:57.400
mean when they're not failing so all

0:46:57.400,0:47:00.720
these visualizations they are there and

0:47:00.720,0:47:02.480
a trained data scientist can look at

0:47:02.480,0:47:05.079
them inspect them and make some kind of

0:47:05.079,0:47:08.400
insightful deductions from them okay

0:47:08.400,0:47:11.079
percentage of failure right uh the

0:47:11.079,0:47:13.640
correlation heat map okay between all

0:47:13.640,0:47:15.559
these different feature variables and

0:47:15.559,0:47:16.920
also the target

0:47:16.920,0:47:19.599
variable okay uh the product types

0:47:19.599,0:47:21.079
percentage of product types percentage

0:47:21.079,0:47:23.160
of failure with respect to the product

0:47:23.160,0:47:25.720
type so we can also kind of visualize

0:47:25.720,0:47:27.800
that as well so certain products have a

0:47:27.800,0:47:29.839
higher ratio of faure compared to other

0:47:29.839,0:47:33.240
product types Etc or for example uh M

0:47:33.240,0:47:35.800
tends to feel more than H products etc

0:47:35.800,0:47:38.880
etc so we can create a vast variety of

0:47:38.880,0:47:41.319
visualizations in the Eda stage so you

0:47:41.319,0:47:43.960
can see here and again the idea of this

0:47:43.960,0:47:46.359
visualization is just to give us some

0:47:46.359,0:47:49.680
insight some preliminary insight into

0:47:49.680,0:47:52.520
our data set that helps us to model it

0:47:52.520,0:47:54.119
more correctly so some more insights

0:47:54.119,0:47:56.200
that we get into our data set from all

0:47:56.200,0:47:57.599
this visualization

0:47:57.599,0:47:59.559
then we can plot the distribution so we

0:47:59.559,0:48:00.720
can see whether it's a normal

0:48:00.720,0:48:03.079
distribution or some other kind of

0:48:03.079,0:48:05.640
distribution uh we can have a box plot

0:48:05.640,0:48:07.760
to see whether there are any outliers in

0:48:07.760,0:48:10.400
your data set and so on right so we can

0:48:10.400,0:48:11.640
see from the box plots we can see

0:48:11.640,0:48:14.599
rotational speed and have outliers so we

0:48:14.599,0:48:16.880
already saw outliers are basically a

0:48:16.880,0:48:18.800
problem that you may need to kind of

0:48:18.800,0:48:22.520
tackle right so outliers are an isue uh

0:48:22.520,0:48:24.800
it's a it's a part of data cleansing and

0:48:24.800,0:48:26.960
so you may need to tackle this so we may

0:48:26.960,0:48:28.880
have to check okay well where are the

0:48:28.880,0:48:31.319
potential outliers so we can analyze

0:48:31.319,0:48:35.319
them from the box blot okay um but then

0:48:35.319,0:48:37.079
we can say well they are outliers but

0:48:37.079,0:48:38.800
maybe they're not really horrible

0:48:38.800,0:48:40.760
outliers so we can tolerate them or

0:48:40.760,0:48:42.880
maybe we want to remove them so we can

0:48:42.880,0:48:44.920
see what the mean and maximum values for

0:48:44.920,0:48:46.720
all these with respect to product type

0:48:46.720,0:48:49.680
how many of them are above or highly

0:48:49.680,0:48:51.440
correlated with the product type in

0:48:51.440,0:48:54.240
terms of the maximum and minimum okay

0:48:54.240,0:48:56.960
and then so on so the Insight is well we

0:48:56.960,0:48:59.599
got 4.8% of the instances are outliers

0:48:59.599,0:49:02.559
so maybe 4.87% is not really that much

0:49:02.559,0:49:04.920
the outliers are not horrible so we just

0:49:04.920,0:49:06.960
leave them in the data set now for a

0:49:06.960,0:49:08.520
different data set the data scientist

0:49:08.520,0:49:10.280
could come to different conclusion so

0:49:10.280,0:49:12.280
then they would do whatever they've

0:49:12.280,0:49:15.400
deemed is appropriate to kind of cleanse

0:49:15.400,0:49:18.079
the data set okay so now that we have

0:49:18.079,0:49:20.000
done all the Eda the next thing we're

0:49:20.000,0:49:23.160
going to do is we are going to do what

0:49:23.160,0:49:26.200
is called feature engineering so we are

0:49:26.200,0:49:28.760
going to transform our original feature

0:49:28.760,0:49:31.280
variables and these are our original

0:49:31.280,0:49:32.960
feature variables right these are our

0:49:32.960,0:49:35.040
original feature variables and we are

0:49:35.040,0:49:37.760
going to transform them all right we're

0:49:37.760,0:49:40.319
going to transform them in some sense uh

0:49:40.319,0:49:43.760
into some other form before we fit this

0:49:43.760,0:49:45.640
for training into our machine learning

0:49:45.640,0:49:48.599
algorithm all right so these are

0:49:48.599,0:49:51.599
examples of let's say this example of a

0:49:51.599,0:49:55.200
original data set right and this is

0:49:55.200,0:49:56.839
examples these are some of the examples

0:49:56.839,0:49:58.040
you don't have to use all of them but

0:49:58.040,0:49:59.440
these are some of examples of what we

0:49:59.440,0:50:00.839
call feature engineering which you can

0:50:00.839,0:50:03.559
then transform your original values in

0:50:03.559,0:50:05.280
your feature variables to all these

0:50:05.280,0:50:07.920
transform values here so we're going to

0:50:07.920,0:50:09.680
pretty much do that here so we have a

0:50:09.680,0:50:12.599
ordinal encoding we do scaling of the

0:50:12.599,0:50:14.839
data so the data set is scaled we use a

0:50:14.839,0:50:18.240
minmax scaling and then finally we come

0:50:18.240,0:50:21.720
to do a modeling so we have to split our

0:50:21.720,0:50:24.359
data set into a training data set and a

0:50:24.359,0:50:28.640
test data set so coming back to again um

0:50:28.640,0:50:32.160
we said that in a before you train your

0:50:32.160,0:50:33.799
model sorry before you train your model

0:50:33.799,0:50:35.599
you have to take your original data set

0:50:35.599,0:50:37.319
now this is a featured engineered data

0:50:37.319,0:50:38.839
set we're going to break it into two or

0:50:38.839,0:50:40.839
more subsets okay so one is called the

0:50:40.839,0:50:42.400
training data set that we use to Feit

0:50:42.400,0:50:44.000
and train a machine learning model the

0:50:44.000,0:50:45.920
second is test data set to evaluate the

0:50:45.920,0:50:47.960
accuracy of the model okay so we got

0:50:47.960,0:50:50.559
this training data set your test data

0:50:50.559,0:50:52.720
set and we also need

0:50:52.720,0:50:56.160
to sample so from our original data set

0:50:56.160,0:50:57.400
we need to sample sample some points

0:50:57.400,0:50:58.839
that go into your training data set some

0:50:58.839,0:51:00.559
points that go in your test data set so

0:51:00.559,0:51:02.720
there are many ways to do sampling one

0:51:02.720,0:51:04.920
way is to do stratified sampling where

0:51:04.920,0:51:06.720
we ensure the same proportion of data

0:51:06.720,0:51:09.000
from each steta or class because right

0:51:09.000,0:51:10.960
now we have a multiclass classification

0:51:10.960,0:51:12.319
problem so you want to make sure the

0:51:12.319,0:51:13.960
same proportion of data from each TR

0:51:13.960,0:51:15.839
class is equally proportional in the

0:51:15.839,0:51:17.920
training and test data set as the

0:51:17.920,0:51:20.119
original data set which is very useful

0:51:20.119,0:51:21.640
for dealing with what is called an

0:51:21.640,0:51:24.319
imbalanced data set so here we have an

0:51:24.319,0:51:25.839
example of what is called an imbalanced

0:51:25.839,0:51:29.520
data set in the sense that you have the

0:51:29.520,0:51:32.760
vast majority of data points in your

0:51:32.760,0:51:34.960
data set they are going to have the

0:51:34.960,0:51:37.480
value of zero for their target variable

0:51:37.480,0:51:40.200
colume so only a extremely small

0:51:40.200,0:51:43.119
minority of the data points in your data

0:51:43.119,0:51:45.319
set will actually have the value of one

0:51:45.319,0:51:48.720
for their target variable colume okay so

0:51:48.720,0:51:51.040
a situation where you have your class or

0:51:51.040,0:51:52.520
your target variable colume where the

0:51:52.520,0:51:54.480
vast majority of values are from one

0:51:54.480,0:51:58.119
class and a tiny small minority are from

0:51:58.119,0:52:00.520
another class we call this an imbalanced

0:52:00.520,0:52:02.720
data set and for an imbalanced data set

0:52:02.720,0:52:04.319
typically we will have a specific

0:52:04.319,0:52:05.920
technique to do the train test split

0:52:05.920,0:52:08.119
which is called stratified sampling and

0:52:08.119,0:52:09.599
so that's what's exactly happening here

0:52:09.599,0:52:12.000
we're doing a stratified split here so

0:52:12.000,0:52:14.839
we are doing a train test split here uh

0:52:14.839,0:52:17.520
and we are doing a stratified split uh

0:52:17.520,0:52:20.359
and then now we actually develop the

0:52:20.359,0:52:23.359
models so now we've got the train test

0:52:23.359,0:52:25.480
plate now here is where we actually

0:52:25.480,0:52:27.079
train the models

0:52:27.079,0:52:29.920
now in terms of classification there are

0:52:29.920,0:52:32.319
a whole bunch of

0:52:32.319,0:52:35.400
possibilities right that you can use

0:52:35.400,0:52:38.480
there are many many different algorithms

0:52:38.480,0:52:41.000
that we can use to create a

0:52:41.000,0:52:42.839
classification model so this are an

0:52:42.839,0:52:45.079
example of some of the more common ones

0:52:45.079,0:52:47.480
logistic support Vector machine decision

0:52:47.480,0:52:49.520
trees random Forest bagging balance

0:52:49.520,0:52:52.720
bagging boost assemble Ensemble so all

0:52:52.720,0:52:55.040
these are different algorithms which

0:52:55.040,0:52:57.760
will create different kind of models

0:52:57.760,0:53:01.599
which will result in different accuracy

0:53:01.599,0:53:05.400
measures okay so it's the goal of the

0:53:05.400,0:53:08.920
data scientist to find the best model

0:53:08.920,0:53:11.520
that gives the best accuracy for the

0:53:11.520,0:53:14.119
given data set for training on that

0:53:14.119,0:53:16.880
given data set so let's head back again

0:53:16.880,0:53:19.760
to uh our machine learning workflow so

0:53:19.760,0:53:21.520
here basically what I'm doing is I'm

0:53:21.520,0:53:23.520
creating a whole bunch of models here

0:53:23.520,0:53:25.520
all right so one is a random Forest one

0:53:25.520,0:53:27.160
is balance bagging one is a boost

0:53:27.160,0:53:29.520
classifier one's The Ensemble classifier

0:53:29.520,0:53:32.760
and using all of these I am going to

0:53:32.760,0:53:35.319
basically Feit or train my model using

0:53:35.319,0:53:37.440
all these algorithms and then I'm going

0:53:37.440,0:53:39.799
to evaluate them okay I'm going to

0:53:39.799,0:53:42.480
evaluate how good each of these models

0:53:42.480,0:53:45.760
are and here you can see your value your

0:53:45.760,0:53:48.839
evaluation data right okay and this is

0:53:48.839,0:53:50.839
the confusion Matrix which is another

0:53:50.839,0:53:54.280
way of evaluating so now we come to the

0:53:54.280,0:53:56.319
kind of the the the key part here which

0:53:56.319,0:53:58.520
is which is how do I distinguish between

0:53:58.520,0:54:00.079
all these models right I've got all

0:54:00.079,0:54:01.400
these different models which are built

0:54:01.400,0:54:03.040
with different algorithms which I'm

0:54:03.040,0:54:05.359
using to train on the same data set how

0:54:05.359,0:54:07.359
do I distinguish between all these

0:54:07.359,0:54:10.359
models okay and so for that sense for

0:54:10.359,0:54:13.880
that we actually have a whole bunch of

0:54:13.880,0:54:16.200
common evaluation matrics for

0:54:16.200,0:54:18.319
classification right so this evaluation

0:54:18.319,0:54:22.240
matrics tell us how good a model is in

0:54:22.240,0:54:24.319
terms of its accuracy in

0:54:24.319,0:54:27.000
classification so in terms of

0:54:27.000,0:54:29.440
accuracy we actually have many different

0:54:29.440,0:54:31.680
models uh sorry many different measures

0:54:31.680,0:54:33.440
right you might think well accuracy is

0:54:33.440,0:54:35.400
just accuracy well that's all right it's

0:54:35.400,0:54:36.880
just either it's accurate or it's not

0:54:36.880,0:54:39.319
accurate right but actually it's not

0:54:39.319,0:54:41.359
that simple there are many different

0:54:41.359,0:54:43.839
ways to measure the accuracy of a

0:54:43.839,0:54:45.480
classification model and these are some

0:54:45.480,0:54:48.280
of the more common ones so for example

0:54:48.280,0:54:51.000
the confusion metrix tells us how many

0:54:51.000,0:54:54.000
true positives that means the value is

0:54:54.000,0:54:55.880
positive the prediction is positive how

0:54:55.880,0:54:57.520
many false FAL positives which means the

0:54:57.520,0:54:59.040
value is negative the machine learning

0:54:59.040,0:55:01.839
model predicts positive how many false

0:55:01.839,0:55:03.839
negatives which means that the machine

0:55:03.839,0:55:05.559
learning model predicts negative but

0:55:05.559,0:55:07.480
it's actually positive and how many true

0:55:07.480,0:55:09.359
negatives there are which means that the

0:55:09.359,0:55:11.240
machine the machine learning model

0:55:11.240,0:55:12.880
predicts negative and the true value is

0:55:12.880,0:55:14.760
also negative so this is called a

0:55:14.760,0:55:16.920
confusion Matrix this is one way we

0:55:16.920,0:55:19.480
assess or evaluate the performance of a

0:55:19.480,0:55:20.520
classification

0:55:20.520,0:55:23.319
model okay this is for binary

0:55:23.319,0:55:24.680
classification we can also have

0:55:24.680,0:55:26.880
multiclass confusion Matrix

0:55:26.880,0:55:29.000
and then we can also measure things like

0:55:29.000,0:55:31.720
accuracy so accuracy is the true

0:55:31.720,0:55:34.079
positives plus the true negatives which

0:55:34.079,0:55:35.440
is the total number of correct

0:55:35.440,0:55:37.839
predictions made by the model divided by

0:55:37.839,0:55:39.839
the total number of data points in your

0:55:39.839,0:55:42.599
data set and then you have also other

0:55:42.599,0:55:43.720
kinds of

0:55:43.720,0:55:46.599
measures uh such as recall and this is a

0:55:46.599,0:55:49.160
formula for recall this is a formula for

0:55:49.160,0:55:51.480
the F1 score okay and then there's

0:55:51.480,0:55:55.559
something called the uh R curve right so

0:55:55.559,0:55:57.039
without going too much in the detail of

0:55:57.039,0:55:59.000
what each of these entails essentially

0:55:59.000,0:56:00.640
these are all different ways these are

0:56:00.640,0:56:03.280
different kpi right just like if you

0:56:03.280,0:56:06.119
work in a company you have different kpi

0:56:06.119,0:56:08.079
right certain employees have certain kpi

0:56:08.079,0:56:11.280
that measures how good or how how uh you

0:56:11.280,0:56:13.200
know efficient or how effective a

0:56:13.200,0:56:16.240
particular employee is right so the

0:56:16.240,0:56:19.880
kpi kpi for your machine learning models

0:56:19.880,0:56:24.240
are Roc curve F1 score recall accuracy

0:56:24.240,0:56:26.599
okay and your confusion Matrix so so

0:56:26.599,0:56:29.839
fundamentally after I have built right

0:56:29.839,0:56:33.359
so here I've built my four different

0:56:33.359,0:56:35.240
models so after I built these form

0:56:35.240,0:56:37.640
different models I'm going to check and

0:56:37.640,0:56:39.680
evaluate them using all those different

0:56:39.680,0:56:42.440
metrics like for example the F1 score

0:56:42.440,0:56:44.839
the Precision score the recall score all

0:56:44.839,0:56:47.319
right so for this model I can check out

0:56:47.319,0:56:50.039
the ROC score the F1 score the Precision

0:56:50.039,0:56:52.119
score the recall score then for this

0:56:52.119,0:56:54.799
model this is the ROC score the F1 score

0:56:54.799,0:56:56.839
the Precision score the recall called

0:56:56.839,0:56:59.680
then for this model and so on so for

0:56:59.680,0:57:03.240
every single model I've created using my

0:57:03.240,0:57:05.839
training data set I will have all my set

0:57:05.839,0:57:08.000
of evaluation metrics that I can use to

0:57:08.000,0:57:11.839
evaluate how good this model is okay

0:57:11.839,0:57:13.119
same thing here I've got a confusion

0:57:13.119,0:57:15.079
Matrix here right so I can use that

0:57:15.079,0:57:18.119
again to evaluate between all these four

0:57:18.119,0:57:20.200
different models and then I kind of

0:57:20.200,0:57:22.240
summarize it up here so we can see from

0:57:22.240,0:57:25.440
this summary here that actually the top

0:57:25.440,0:57:27.599
two models right which are I'm going to

0:57:27.599,0:57:29.440
give a lot as a data scientist I'm now

0:57:29.440,0:57:31.119
going to just focus on these two models

0:57:31.119,0:57:33.440
so these two models are begging

0:57:33.440,0:57:36.000
classifier and random Forest classifier

0:57:36.000,0:57:38.480
they have the highest values of F1 score

0:57:38.480,0:57:40.480
and the highest values of the rooc curve

0:57:40.480,0:57:42.640
score okay so we can say these are the

0:57:42.640,0:57:45.839
top two models in terms of accuracy okay

0:57:45.839,0:57:48.920
using the fub1 evaluation metric and the

0:57:48.920,0:57:53.720
r Au evaluation metric okay so these

0:57:53.720,0:57:57.480
results uh kind of summarize here and

0:57:57.480,0:57:59.079
then we use different sampling

0:57:59.079,0:58:00.880
techniques okay so just now I talked

0:58:00.880,0:58:03.680
about um different kinds of sampling

0:58:03.680,0:58:06.400
techniques and so the idea of different

0:58:06.400,0:58:08.319
kinds of sampling techniques is to just

0:58:08.319,0:58:11.319
get a different feel for different

0:58:11.319,0:58:13.720
distributions of the data in different

0:58:13.720,0:58:16.359
areas of your data set so that you want

0:58:16.359,0:58:20.000
to just kind of make sure that your your

0:58:20.000,0:58:22.799
your evaluation of accuracy is actually

0:58:22.799,0:58:27.079
statistically correct right so we can um

0:58:27.079,0:58:29.599
do what is called oversampling and under

0:58:29.599,0:58:30.880
sampling which is very useful when

0:58:30.880,0:58:32.280
you're working with an imbalance data

0:58:32.280,0:58:35.039
set so this is example of doing that and

0:58:35.039,0:58:37.240
then here we again again check out the

0:58:37.240,0:58:38.799
results for all these different

0:58:38.799,0:58:41.680
techniques we use uh the F1 score the Au

0:58:41.680,0:58:43.599
score all right these are the two key

0:58:43.599,0:58:46.760
measures of accuracy right so and then

0:58:46.760,0:58:47.920
we can check out the scores for the

0:58:47.920,0:58:50.480
different approaches okay so we can see

0:58:50.480,0:58:53.119
oh well overall the models have lower Au

0:58:53.119,0:58:55.720
r r Au C score but they have a much

0:58:55.720,0:58:58.280
higher F1 score the begging classifier

0:58:58.280,0:59:00.839
had the highest R1 highest roc1 score

0:59:00.839,0:59:04.119
but F1 score was too low okay then in

0:59:04.119,0:59:06.520
the data scientist opinion the random

0:59:06.520,0:59:08.520
forest with this particular technique of

0:59:08.520,0:59:10.760
sampling has equilibrium between the F1

0:59:10.760,0:59:14.480
R F1 R and A score so the takeaway one

0:59:14.480,0:59:16.680
is the macro F1 score improves

0:59:16.680,0:59:18.480
dramatically using the sampl sampling

0:59:18.480,0:59:20.160
techniqu so these models might be better

0:59:20.160,0:59:22.440
compared to the balanced ones all right

0:59:22.440,0:59:26.280
so based on all this uh evaluation the

0:59:26.280,0:59:27.680
data scientist says they're going to

0:59:27.680,0:59:29.920
continue to work with these two models

0:59:29.920,0:59:31.440
all right and the balance begging one

0:59:31.440,0:59:33.079
and then continue to make further

0:59:33.079,0:59:35.039
comparisons all right so then we

0:59:35.039,0:59:37.079
continue to keep refining on our

0:59:37.079,0:59:38.599
evaluation work here we're going to

0:59:38.599,0:59:41.000
train the models one more time again so

0:59:41.000,0:59:43.039
we again do a training test plate and

0:59:43.039,0:59:44.799
then we do that for this particular uh

0:59:44.799,0:59:47.039
approach model and then we print out we

0:59:47.039,0:59:48.200
print out what is called a

0:59:48.200,0:59:50.960
classification report and this is

0:59:50.960,0:59:53.400
basically a summary of all those metrics

0:59:53.400,0:59:55.359
that I talk about just now so just now

0:59:55.359,0:59:57.520
remember I said the the there was

0:59:57.520,0:59:59.680
several evaluation metrics right so uh

0:59:59.680,1:00:01.480
we had the confusion matrics the

1:00:01.480,1:00:04.119
accuracy the Precision the recall the Au

1:00:04.119,1:00:08.119
ccore so here with the um classification

1:00:08.119,1:00:09.880
report I can get a summary of all of

1:00:09.880,1:00:11.760
that so I can see all the values here

1:00:11.760,1:00:14.640
okay for this particular model begging

1:00:14.640,1:00:17.160
Tomac links and then I can do that for

1:00:17.160,1:00:18.640
another model the random Forest

1:00:18.640,1:00:20.599
borderline SME and then I can do that

1:00:20.599,1:00:22.200
for another model which is the balance

1:00:22.200,1:00:25.160
ping so again we see this a lot of

1:00:25.160,1:00:27.079
comparison between different models

1:00:27.079,1:00:28.640
trying to figure out what all these

1:00:28.640,1:00:30.720
evaluation metrics are telling us all

1:00:30.720,1:00:32.960
right then again we have a confusion

1:00:32.960,1:00:35.880
Matrix so we generate a confusion Matrix

1:00:35.880,1:00:38.880
for the bagging with the toac links

1:00:38.880,1:00:40.720
under sampling for the random followers

1:00:40.720,1:00:42.680
with the borderline mod over sampling

1:00:42.680,1:00:44.960
and just balance begging by itself then

1:00:44.960,1:00:47.720
again we compare between these three uh

1:00:47.720,1:00:50.799
models uh using the confusion Matrix

1:00:50.799,1:00:52.599
evaluation Matrix and then we can kind

1:00:52.599,1:00:55.680
of come to some conclusions all right so

1:00:55.680,1:00:58.160
right so now we look at all the data

1:00:58.160,1:01:01.200
then we move on and look at another um

1:01:01.200,1:01:03.160
another kind of evaluation metrix which

1:01:03.160,1:01:06.720
is the r score right so this is one of

1:01:06.720,1:01:08.680
the other evaluation metrics I talk

1:01:08.680,1:01:11.200
about so this one is a kind of a curve

1:01:11.200,1:01:12.520
you look at it to see the area

1:01:12.520,1:01:14.359
underneath the curve this is called AOC

1:01:14.359,1:01:18.079
R area under the curve sorry Au Au R

1:01:18.079,1:01:19.880
area under the curve all right so the

1:01:19.880,1:01:21.839
area under the curve uh

1:01:21.839,1:01:24.319
score will give us some idea about the

1:01:24.319,1:01:25.599
threshold that we're going to use for

1:01:25.599,1:01:27.680
classif ification so we can examine this

1:01:27.680,1:01:29.200
for the bagging classifier for the

1:01:29.200,1:01:30.960
random forest classifier for the balance

1:01:30.960,1:01:33.599
bagging classifier okay then we can also

1:01:33.599,1:01:36.200
again do that uh finally we can check

1:01:36.200,1:01:37.880
the classification report of this

1:01:37.880,1:01:39.680
particular model so we keep doing this

1:01:39.680,1:01:43.200
over and over again evaluating this m

1:01:43.200,1:01:45.720
The Matrix the the accuracy Matrix the

1:01:45.720,1:01:46.880
evaluation Matrix for all these

1:01:46.880,1:01:48.880
different models so we keep doing this

1:01:48.880,1:01:50.520
over and over again for different

1:01:50.520,1:01:53.440
thresholds or for classification and so

1:01:53.440,1:01:56.880
as we keep drilling into these we kind

1:01:56.880,1:02:00.839
of get more and more understanding of

1:02:00.839,1:02:02.799
all these different models which one is

1:02:02.799,1:02:04.760
the best one that gives the best

1:02:04.760,1:02:08.520
performance for our data set okay so

1:02:08.520,1:02:11.440
finally we come to this conclusion this

1:02:11.440,1:02:13.520
particular model is not able to reduce

1:02:13.520,1:02:15.279
the record on failure test than

1:02:15.279,1:02:17.520
95.8% on the other hand balance begging

1:02:17.520,1:02:19.400
with a decision thresold of 0.6 is able

1:02:19.400,1:02:21.520
to have a better recall blah blah blah

1:02:21.520,1:02:25.319
Etc so finally after having done all of

1:02:25.319,1:02:27.480
this evalu ations

1:02:27.480,1:02:31.119
okay this is the conclusion

1:02:31.119,1:02:33.960
so after having gone so right now we

1:02:33.960,1:02:35.279
have gone through all the steps of the

1:02:35.279,1:02:37.760
Machining learning life cycle and which

1:02:37.760,1:02:40.240
means we have right now or the data

1:02:40.240,1:02:41.960
scientist right now has gone through all

1:02:41.960,1:02:43.000
these

1:02:43.000,1:02:47.079
steps uh which is now we have done this

1:02:47.079,1:02:48.640
validation so we have done the cleaning

1:02:48.640,1:02:50.559
exploration preparation transformation

1:02:50.559,1:02:52.599
the future engineering we have developed

1:02:52.599,1:02:54.359
and trained multiple models we have

1:02:54.359,1:02:56.480
evaluated all these different models so

1:02:56.480,1:02:58.599
right now we have reached this stage so

1:02:58.599,1:03:02.720
at this stage we as the data scientist

1:03:02.720,1:03:05.480
kind of have completed our job so we've

1:03:05.480,1:03:08.119
come to some very useful conclusions

1:03:08.119,1:03:09.640
which we now can share with our

1:03:09.640,1:03:13.240
colleagues all right and based on this

1:03:13.240,1:03:15.400
uh conclusions or recommendations

1:03:15.400,1:03:17.160
somebody is going to choose a

1:03:17.160,1:03:19.160
appropriate model and that model is

1:03:19.160,1:03:22.640
going to get deployed for realtime use

1:03:22.640,1:03:25.319
in a real life production environment

1:03:25.319,1:03:27.240
okay and that decision is going to be

1:03:27.240,1:03:29.359
made based on the recommendations coming

1:03:29.359,1:03:30.880
from the data scientist at the end of

1:03:30.880,1:03:33.480
this phase okay so at the end of this

1:03:33.480,1:03:35.079
phase the data scientist is going to

1:03:35.079,1:03:36.880
come up with these conclusions so

1:03:36.880,1:03:41.760
conclusions is okay if the engineering

1:03:41.760,1:03:44.520
team they are looking okay the

1:03:44.520,1:03:46.119
engineering team right the engineering

1:03:46.119,1:03:48.720
team if they are looking for the highest

1:03:48.720,1:03:51.839
failure detection rate possible then

1:03:51.839,1:03:54.480
they should go with this particular

1:03:54.480,1:03:56.520
model okay

1:03:56.520,1:03:58.680
and if they want a balance between

1:03:58.680,1:04:01.039
precision and recall then they should

1:04:01.039,1:04:03.240
choose between the begging model with a

1:04:03.240,1:04:05.960
0.4 decision threshold or the random

1:04:05.960,1:04:09.599
forest model with a 0.5 threshold but if

1:04:09.599,1:04:11.880
they don't care so much about predicting

1:04:11.880,1:04:14.480
every failure and they want the highest

1:04:14.480,1:04:16.760
Precision possible then they should opt

1:04:16.760,1:04:19.799
for the begging toax link classifier

1:04:19.799,1:04:23.160
with a bit higher decision threshold and

1:04:23.160,1:04:26.160
so this is the key thing that the data

1:04:26.160,1:04:28.319
scientist is going to give right this is

1:04:28.319,1:04:30.760
the key takeaway this is the kind of the

1:04:30.760,1:04:32.680
end result of the entire machine

1:04:32.680,1:04:34.680
learning life cycle right now the data

1:04:34.680,1:04:36.400
scientist is going to tell the

1:04:36.400,1:04:38.599
engineering team all right you guys

1:04:38.599,1:04:41.160
which is more important for you point a

1:04:41.160,1:04:45.039
point B or Point C make your decision so

1:04:45.039,1:04:47.400
the engineering team will then discuss

1:04:47.400,1:04:48.960
among themselves and say hey you know

1:04:48.960,1:04:52.279
what what we want is we want to get the

1:04:52.279,1:04:54.720
highest failure detection possible

1:04:54.720,1:04:58.359
because any kind kind of failure of that

1:04:58.359,1:05:00.400
machine or the product on the samply

1:05:00.400,1:05:03.119
line is really going to screw us up big

1:05:03.119,1:05:05.640
time so what we're looking for is the

1:05:05.640,1:05:08.079
model that will give us the highest

1:05:08.079,1:05:10.880
failure detection rate we don't care

1:05:10.880,1:05:13.480
about Precision but we want to be make

1:05:13.480,1:05:15.440
sure that if there's a failure we are

1:05:15.440,1:05:17.720
going to catch it right so that's what

1:05:17.720,1:05:19.599
they want and so the data scientist will

1:05:19.599,1:05:22.200
say Hey you go for the balance begging

1:05:22.200,1:05:24.880
model okay then the data scientist saves

1:05:24.880,1:05:27.720
this all right uh and then once you have

1:05:27.720,1:05:30.000
saved this uh you can then go right

1:05:30.000,1:05:32.319
ahead and deploy that so you can go

1:05:32.319,1:05:33.520
right ahead and deploy that to

1:05:33.520,1:05:37.160
production okay and so if you want to

1:05:37.160,1:05:38.839
continue we can actually further

1:05:38.839,1:05:41.119
continue this modeling problem so just

1:05:41.119,1:05:43.480
now I model this problem as a binary

1:05:43.480,1:05:46.720
classification problem uh sorry just I

1:05:46.720,1:05:48.240
modeled this problem as a binary

1:05:48.240,1:05:49.520
classification which means it's either

1:05:49.520,1:05:51.680
zero or one either fail or not fail but

1:05:51.680,1:05:53.599
we can also model it as a multiclass

1:05:53.599,1:05:55.640
classification problem right because as

1:05:55.640,1:05:57.640
as I said earlier just now for the

1:05:57.640,1:06:00.200
Target variable colum which is sorry for

1:06:00.200,1:06:02.520
the failure type colume you actually

1:06:02.520,1:06:04.839
have multiple kinds of failures right

1:06:04.839,1:06:07.559
for example you may have a power failure

1:06:07.559,1:06:10.000
uh you may have a towar failure uh you

1:06:10.000,1:06:12.920
may have a overstrain failure so now we

1:06:12.920,1:06:14.839
can model the problem slightly

1:06:14.839,1:06:17.240
differently so we can model it as a

1:06:17.240,1:06:19.680
multiclass classification problem and

1:06:19.680,1:06:21.160
then we go through the entire same

1:06:21.160,1:06:22.680
process that we went through just now so

1:06:22.680,1:06:24.880
we create different models we test this

1:06:24.880,1:06:26.720
out but now the confusion Matrix is for

1:06:26.720,1:06:30.119
a multiclass classification isue right

1:06:30.119,1:06:30.960
so we're going

1:06:30.960,1:06:34.039
to check them out we're going to again

1:06:34.039,1:06:36.079
uh try different algorithms or models

1:06:36.079,1:06:38.039
again train and test our data set do the

1:06:38.039,1:06:39.760
training test split uh on these

1:06:39.760,1:06:42.000
different models all right so we have

1:06:42.000,1:06:43.400
like for example we have bon random

1:06:43.400,1:06:46.160
Forest B random Forest a great search

1:06:46.160,1:06:47.720
then you train the models using what is

1:06:47.720,1:06:49.680
called hyperparameter tuning then you

1:06:49.680,1:06:51.079
get the scores all right so you get the

1:06:51.079,1:06:53.160
same evaluation scores again you check

1:06:53.160,1:06:54.599
out the evaluation scores compare

1:06:54.599,1:06:57.079
between them generate a confusion Matrix

1:06:57.079,1:06:59.960
so this is a multiclass confusion Matrix

1:06:59.960,1:07:02.400
and then you come to the final

1:07:02.400,1:07:05.760
conclusion so now if you are interested

1:07:05.760,1:07:09.000
to frame your problem domain as a

1:07:09.000,1:07:11.359
multiclass classification problem all

1:07:11.359,1:07:13.839
right then these are the recommendations

1:07:13.839,1:07:15.480
from the data scientist so the data

1:07:15.480,1:07:17.240
scientist will say you know what I'm

1:07:17.240,1:07:19.559
going to pick this particular model the

1:07:19.559,1:07:22.039
balance backing classifier and these are

1:07:22.039,1:07:24.520
all the reasons that the data scientist

1:07:24.520,1:07:27.279
is going to give as a rational for

1:07:27.279,1:07:29.400
selecting this particular

1:07:29.400,1:07:32.039
model and then once that's done you save

1:07:32.039,1:07:35.000
the model and that's that's it that's it

1:07:35.000,1:07:38.920
so that's all done now and so then the

1:07:38.920,1:07:41.039
uh the model the machine learning model

1:07:41.039,1:07:43.720
now you can put it live run it on the

1:07:43.720,1:07:45.279
server and now the machine learning

1:07:45.279,1:07:47.200
model is ready to work which means it's

1:07:47.200,1:07:48.920
ready to generate predictions right

1:07:48.920,1:07:50.279
that's the main job of the machine

1:07:50.279,1:07:52.039
learning model you have picked the best

1:07:52.039,1:07:53.680
machine learning model with the best

1:07:53.680,1:07:55.799
evaluation metrics for whatever accur

1:07:55.799,1:07:57.760
see goal you're trying to achieve and

1:07:57.760,1:07:59.640
now you're going to run it on a server

1:07:59.640,1:08:00.799
and now you're going to get all this

1:08:00.799,1:08:02.960
real time data that's coming from your

1:08:02.960,1:08:04.520
sensus you're going to pump that into

1:08:04.520,1:08:06.359
your machine learning model your machine

1:08:06.359,1:08:07.880
learning model will pump out a whole

1:08:07.880,1:08:09.520
bunch of predictions and we're going to

1:08:09.520,1:08:12.799
use that predictions in real time to

1:08:12.799,1:08:15.400
make real time real world decision

1:08:15.400,1:08:17.560
making right you're going to say okay

1:08:17.560,1:08:19.600
I'm predicting that that machine is

1:08:19.600,1:08:23.198
going to fail on Thursday at 5:00 p.m.

1:08:23.198,1:08:25.520
so you better get your service folks in

1:08:25.520,1:08:28.640
to service it on Thursday 2: p.m. or you

1:08:28.640,1:08:31.640
know whatever so you can you know uh

1:08:31.640,1:08:33.479
make decisions on when you want to do

1:08:33.479,1:08:35.319
your maintenance you know and and make

1:08:35.319,1:08:37.640
the best decisions to optimize the cost

1:08:37.640,1:08:41.158
of Maintenance etc etc and then based on

1:08:41.158,1:08:42.120
the

1:08:42.120,1:08:45.000
results that are coming up from the

1:08:45.000,1:08:46.759
predictions so the predictions may be

1:08:46.759,1:08:49.120
good the predictions may be lousy the

1:08:49.120,1:08:51.359
predictions may be average right so we

1:08:51.359,1:08:53.719
are we're constantly monitoring how good

1:08:53.719,1:08:55.439
or how useful are the predictions

1:08:55.439,1:08:57.759
generated by this realtime model that's

1:08:57.759,1:08:59.880
running on the server and based on our

1:08:59.880,1:09:02.679
monitoring we will then take some new

1:09:02.679,1:09:05.319
data and then repeat this entire life

1:09:05.319,1:09:07.040
cycle again so this is basically a

1:09:07.040,1:09:09.238
workflow that's iterative and we are

1:09:09.238,1:09:11.120
constantly or the data scientist is

1:09:11.120,1:09:13.319
constantly getting in all these new data

1:09:13.319,1:09:15.279
points and then refining the model

1:09:15.279,1:09:17.960
picking maybe a new model deploying the

1:09:17.960,1:09:21.679
new model onto the server and so on all

1:09:21.679,1:09:23.920
right and so that's it so that is

1:09:23.920,1:09:26.399
basically your machine learning workflow

1:09:26.399,1:09:29.479
in a nutshell okay so for this

1:09:29.479,1:09:32.080
particular approach we have used a bunch

1:09:32.080,1:09:34.560
of uh data science libraries from python

1:09:34.560,1:09:36.520
so we have used pandas which is the most

1:09:36.520,1:09:38.560
B basic data science libraries that

1:09:38.560,1:09:40.279
provides all the tools to work with raw

1:09:40.279,1:09:42.520
data we have used numai which is a high

1:09:42.520,1:09:44.080
performance library for implementing

1:09:44.080,1:09:46.439
complex array metrix operations we have

1:09:46.439,1:09:49.560
used met plot lip and cbon which is used

1:09:49.560,1:09:52.439
for doing the Eda the explorat

1:09:52.439,1:09:55.560
exploratory data analysis phase machine

1:09:55.560,1:09:57.040
learning where you visualize all your

1:09:57.040,1:09:59.040
data we have used psyit learn which is

1:09:59.040,1:10:01.280
the machine L learning library to do all

1:10:01.280,1:10:02.920
your implementation for all your call

1:10:02.920,1:10:06.000
machine learning algorithms uh we we we

1:10:06.000,1:10:08.000
have not used this because this is not a

1:10:08.000,1:10:11.040
deep learning uh problem but if you are

1:10:11.040,1:10:12.800
working with a deep learning problem

1:10:12.800,1:10:15.360
like image classification image

1:10:15.360,1:10:17.840
recognition object detection okay

1:10:17.840,1:10:20.199
natural language processing text

1:10:20.199,1:10:21.920
classification well then you're going to

1:10:21.920,1:10:24.360
use these libraries from python which is

1:10:24.360,1:10:28.960
tensor flow okay and also py

1:10:28.960,1:10:32.679
to and then lastly that whole thing that

1:10:32.679,1:10:34.719
whole data science project that you saw

1:10:34.719,1:10:36.800
just now this entire data science

1:10:36.800,1:10:38.880
project is actually developed in

1:10:38.880,1:10:41.080
something called a Jupiter notebook so

1:10:41.080,1:10:44.040
all this python code along with all the

1:10:44.040,1:10:46.360
observations from the data

1:10:46.360,1:10:48.679
scientists okay for this entire data

1:10:48.679,1:10:50.440
science project was actually run in

1:10:50.440,1:10:53.360
something called a Jupiter notebook so

1:10:53.360,1:10:55.760
that is uh the

1:10:55.760,1:10:59.080
most widely used tool for interactively

1:10:59.080,1:11:02.360
developing and presenting data science

1:11:02.360,1:11:04.640
projects okay so that brings me to the

1:11:04.640,1:11:07.400
end of this entire presentation I hope

1:11:07.400,1:11:10.360
that you find it useful for you and that

1:11:10.360,1:11:13.199
you can appreciate the importance of

1:11:13.199,1:11:15.280
machine learning and how it can be

1:11:15.280,1:11:19.800
applied in a real life use case in a

1:11:19.800,1:11:23.360
typical production environment all right

1:11:23.360,1:11:27.239
thank you all so much for watching