WEBVTT

00:00:01.199 --> 00:00:03.760
Hello everyone, my name is Victor. I'm

00:00:03.760 --> 00:00:05.359
your friendly neighborhood data

00:00:05.359 --> 00:00:07.759
scientist from DreamCatcher. So in this

00:00:07.759 --> 00:00:10.160
presentation, I would like to talk about

00:00:10.160 --> 00:00:12.759
a specific industry use case of AI or

00:00:12.759 --> 00:00:15.069
machine learning which is predictive

00:00:15.069 --> 00:00:19.000
maintenance. So I will be covering these

00:00:19.000 --> 00:00:21.320
topics and feel free to jump forward to

00:00:21.320 --> 00:00:23.359
the specific part in the video where I

00:00:23.359 --> 00:00:25.160
talk about all these topics. So I'm going

00:00:25.160 --> 00:00:27.160
to start off with a general preview of

00:00:27.160 --> 00:00:29.080
AI and machine learning. Then, I'll

00:00:29.080 --> 00:00:30.840
discuss the use case which is predictive

00:00:30.840 --> 00:00:32.719
maintenance. I'll talk about the basics

00:00:32.719 --> 00:00:34.800
of machine learning, the workflow of

00:00:34.800 --> 00:00:37.239
machine learning, and then we will come

00:00:37.239 --> 00:00:40.760
to the meat of this presentation which

00:00:40.760 --> 00:00:43.680
is essentially a demonstration of the

00:00:43.680 --> 00:00:45.399
machine learning workflow from end to

00:00:45.399 --> 00:00:47.579
end on a real life predictive

00:00:47.579 --> 00:00:51.520
maintenance domain problem. All right, so

00:00:51.520 --> 00:00:53.640
without any further ado, let's jump into

00:00:53.640 --> 00:00:56.680
it. So let's start off with a quick

00:00:56.680 --> 00:01:00.079
preview of AI and machine learning. Well

00:01:00.079 --> 00:01:03.600
AI is a very general term, it encompasses

00:01:03.600 --> 00:01:06.680
the entire area of science and

00:01:06.680 --> 00:01:09.040
engineering that is related to creating

00:01:09.040 --> 00:01:10.840
software programs and machines that

00:01:10.840 --> 00:01:13.759
will be capable of performing tasks

00:01:13.759 --> 00:01:16.080
that would normally require human

00:01:16.080 --> 00:01:19.600
intelligence. But AI is a catchall term,

00:01:19.600 --> 00:01:22.920
so really when we talk about apply AI,

00:01:22.920 --> 00:01:25.920
how we use AI in our daily work, we are

00:01:25.920 --> 00:01:27.720
really going to be talking about machine

00:01:27.720 --> 00:01:30.000
learning. So machine learning is the

00:01:30.000 --> 00:01:31.680
design and application of software

00:01:31.680 --> 00:01:34.079
algorithms that are capable of learning

00:01:34.079 --> 00:01:37.960
on their own without any explicit human

00:01:37.960 --> 00:01:40.399
intervention. And the primary purpose of

00:01:40.399 --> 00:01:43.280
these algorithms are to optimize

00:01:43.280 --> 00:01:46.840
performance in a specific task. And the

00:01:46.840 --> 00:01:49.680
primary performance or the primary task

00:01:49.680 --> 00:01:52.000
that you want to optimize performance in

00:01:52.000 --> 00:01:54.240
is to be able to make accurate

00:01:54.240 --> 00:01:57.479
predictions about future outcomes based

00:01:57.479 --> 00:02:00.560
on the analysis of historical data

00:02:00.560 --> 00:02:02.960
from the past. So essentially machine

00:02:02.960 --> 00:02:05.320
learning is about making predictions

00:02:05.320 --> 00:02:06.880
about the future or what we call

00:02:06.880 --> 00:02:08.919
predictive analytics.

00:02:08.919 --> 00:02:11.000
And there are many different

00:02:11.000 --> 00:02:12.720
kinds of algorithms that are available in

00:02:12.720 --> 00:02:14.519
machine learning under the three primary

00:02:14.519 --> 00:02:16.440
categories of supervised learning,

00:02:16.440 --> 00:02:18.920
unsupervised learning, and reinforcement

00:02:18.920 --> 00:02:21.440
learning. And here we can see some of the

00:02:21.440 --> 00:02:23.560
different kinds of algorithms and their

00:02:23.560 --> 00:02:27.480
use cases in various areas in

00:02:27.480 --> 00:02:29.680
industry. So we have various domain use

00:02:29.680 --> 00:02:30.480
cases

00:02:30.480 --> 00:02:31.800
for all these different kind of

00:02:31.800 --> 00:02:33.840
algorithms, and we can see that different

00:02:33.840 --> 00:02:38.120
algorithms are fitted for different use cases.

00:02:38.120 --> 00:02:41.000
Deep learning is an advanced form

00:02:41.000 --> 00:02:42.400
of machine learning that's based on

00:02:42.400 --> 00:02:44.280
something called an artificial neural

00:02:44.280 --> 00:02:46.319
network or ANN for short, and this

00:02:46.319 --> 00:02:47.840
essentially simulates the structure of

00:02:47.840 --> 00:02:49.519
the human brain whereby neurons

00:02:49.519 --> 00:02:51.360
interconnect and work together to

00:02:51.360 --> 00:02:54.959
process and learn new information. So DL

00:02:54.959 --> 00:02:57.239
is the foundational technology for most

00:02:57.239 --> 00:02:59.360
of the popular AI tools that you

00:02:59.360 --> 00:03:01.400
probably have heard of today. So I'm sure

00:03:01.400 --> 00:03:03.200
you have heard of ChatGPT if you haven't

00:03:03.200 --> 00:03:05.360
been living in a cave for the past 2

00:03:05.360 --> 00:03:08.280
years. And yeah, so ChatGPT is an example

00:03:08.280 --> 00:03:10.120
of what we call a large language model

00:03:10.120 --> 00:03:11.599
and that's based on this technology

00:03:11.599 --> 00:03:14.879
called deep learning. Also, all the modern

00:03:14.879 --> 00:03:17.440
computer vision applications where a

00:03:17.440 --> 00:03:20.040
computer program can classify images or

00:03:20.040 --> 00:03:23.239
detect images or recognize images on

00:03:23.239 --> 00:03:25.280
its own, okay, we call this computer

00:03:25.280 --> 00:03:27.760
vision applications. They also use

00:03:27.760 --> 00:03:29.519
this particular form of machine learning

00:03:29.519 --> 00:03:31.560
called deep learning, right? So this is a

00:03:31.560 --> 00:03:33.640
example of an artificial neural network.

00:03:33.640 --> 00:03:35.200
For example, here I have an image of a

00:03:35.200 --> 00:03:37.159
bird that's fed into this artificial

00:03:37.159 --> 00:03:39.560
neural network, and output from this

00:03:39.560 --> 00:03:41.239
artificial neural network is a

00:03:41.239 --> 00:03:43.959
classification of this image into one of

00:03:43.959 --> 00:03:46.400
these three potential categories. So in

00:03:46.400 --> 00:03:49.080
this case, if the ANN has been trained

00:03:49.080 --> 00:03:51.799
properly, we fit in this image, this

00:03:51.799 --> 00:03:54.079
ANN should correctly classify this image

00:03:54.079 --> 00:03:56.879
as a bird, right? So this is a image

00:03:56.879 --> 00:03:58.959
classification problem which is a

00:03:58.959 --> 00:04:01.079
classic use case for an artificial

00:04:01.079 --> 00:04:03.929
neural network in the field of computer

00:04:03.929 --> 00:04:07.879
vision. And just like in the case of

00:04:07.879 --> 00:04:09.400
machine learning, there are a variety of

00:04:09.400 --> 00:04:11.640
algorithms that are available for

00:04:11.640 --> 00:04:13.599
deep learning under the category of

00:04:13.599 --> 00:04:15.000
supervised learning and also

00:04:15.000 --> 00:04:16.839
unsupervised learning.

00:04:16.839 --> 00:04:19.199
All right, so this is how we can

00:04:19.199 --> 00:04:20.839
kind of categorize this. You can think of

00:04:20.839 --> 00:04:23.880
AI is a general area of smart systems

00:04:23.880 --> 00:04:26.560
and machine. Machine learning is

00:04:26.560 --> 00:04:29.360
basically apply AI and deep learning

00:04:29.360 --> 00:04:29.823
is a

00:04:29.823 --> 00:04:32.560
subspecialization of machine learning

00:04:32.560 --> 00:04:35.000
using a particular architecture called

00:04:35.000 --> 00:04:38.760
an artificial neural network.

00:04:38.760 --> 00:04:42.160
And generative AI, so if you talk

00:04:42.160 --> 00:04:45.280
about ChatGPT, okay, Google Gemini,

00:04:45.280 --> 00:04:47.639
Microsoft Copilot, okay, all these

00:04:47.639 --> 00:04:49.600
examples of generative AI, they are

00:04:49.600 --> 00:04:51.600
basically large language models, and they

00:04:51.600 --> 00:04:53.880
are a further subcategory within the

00:04:53.880 --> 00:04:55.170
area of deep

00:04:55.170 --> 00:04:57.759
learning. And there are many applications

00:04:57.759 --> 00:04:59.400
of machine learning in industry right

00:04:59.400 --> 00:05:01.759
now, so pick which particular industry

00:05:01.759 --> 00:05:03.680
are you involved in, and these are all the

00:05:03.680 --> 00:05:05.060
specific areas of

00:05:05.060 --> 00:05:09.960
applications, right? So probably, I'm

00:05:09.960 --> 00:05:11.680
going to guess the vast majority of you

00:05:11.680 --> 00:05:12.880
who are watching this video, you're

00:05:12.880 --> 00:05:14.360
probably coming from the manufacturing

00:05:14.360 --> 00:05:16.639
industry, and so in the manufacturing

00:05:16.639 --> 00:05:18.479
industry some of the standard use cases

00:05:18.479 --> 00:05:20.039
for machine learning and deep learning

00:05:20.039 --> 00:05:23.080
are predicting potential problems, okay?

00:05:23.080 --> 00:05:25.319
So sometimes you call this predictive

00:05:25.319 --> 00:05:27.160
maintenance where you want to predict

00:05:27.160 --> 00:05:28.800
when a problem is going to happen and

00:05:28.800 --> 00:05:30.400
then kind of address it before it

00:05:30.400 --> 00:05:32.759
happens. And then monitoring systems,

00:05:32.759 --> 00:05:35.199
automating your manufacturing assembly

00:05:35.199 --> 00:05:37.880
line or production line, okay, smart

00:05:37.880 --> 00:05:40.120
scheduling, and detecting anomaly on your

00:05:40.120 --> 00:05:41.480
production line.

00:05:42.390 --> 00:05:44.160
Okay, so let's talk about the use

00:05:44.160 --> 00:05:45.680
case here which is predictive

00:05:45.680 --> 00:05:49.280
maintenance, right? So what is predictive

00:05:49.280 --> 00:05:51.720
maintenance? Well predictive maintenance,

00:05:51.720 --> 00:05:53.199
here's the long definition, is a

00:05:53.199 --> 00:05:54.639
equipment maintenance strategy that

00:05:54.639 --> 00:05:56.280
relies on real-time monitoring of

00:05:56.280 --> 00:05:58.360
equipment conditions and data to predict

00:05:58.360 --> 00:06:00.280
equipment failures in advance.

00:06:00.280 --> 00:06:02.680
And this uses advanced data models,

00:06:02.680 --> 00:06:05.240
analytics, and machine learning whereby

00:06:05.240 --> 00:06:07.479
we can reliably assess when failures are

00:06:07.479 --> 00:06:09.199
more likely to occur, including which

00:06:09.199 --> 00:06:11.120
components are more likely to be

00:06:11.120 --> 00:06:13.560
affected on your production or assembly

00:06:13.560 --> 00:06:16.599
line. So where does predictive

00:06:16.599 --> 00:06:18.759
maintenance fit into the overall scheme

00:06:18.759 --> 00:06:20.759
of things, right? So let's talk about the

00:06:20.759 --> 00:06:23.039
kind of standard way that, you know,

00:06:23.039 --> 00:06:25.520
factories or production

00:06:25.520 --> 00:06:27.680
lines, assembly lines in factories tend

00:06:27.680 --> 00:06:31.080
to handle maintenance issues say

00:06:31.080 --> 00:06:33.120
10 or 20 years ago, right? So what you

00:06:33.120 --> 00:06:34.520
have is the, what you would probably

00:06:34.520 --> 00:06:36.400
start off is the most basic mode

00:06:36.400 --> 00:06:38.240
which is reactive maintenance. So you

00:06:38.240 --> 00:06:40.680
just wait until your machine breaks down

00:06:40.680 --> 00:06:43.039
and then you repair, right? The simplest,

00:06:43.039 --> 00:06:44.720
but, of course, I'm sure if you have worked on a

00:06:44.720 --> 00:06:46.720
production line for any period of time,

00:06:46.720 --> 00:06:48.880
you know that this reactive maintenance

00:06:48.880 --> 00:06:50.759
can give you a whole bunch of headaches

00:06:50.759 --> 00:06:52.160
especially if the machine breaks down

00:06:52.160 --> 00:06:54.120
just before a critical delivery deadline,

00:06:54.120 --> 00:06:55.520
right? Then you're going to have a

00:06:55.520 --> 00:06:56.800
backlog of orders and you're going to

00:06:56.800 --> 00:06:59.160
run to a lot of problems. Okay, so we move on

00:06:59.160 --> 00:07:00.879
to preventive maintenance which is

00:07:00.879 --> 00:07:03.840
you regularly schedule a maintenance of

00:07:03.840 --> 00:07:07.000
your production machines to reduce

00:07:07.000 --> 00:07:08.800
the failure rate. So you might do

00:07:08.800 --> 00:07:10.520
maintenance once every month, once every

00:07:10.520 --> 00:07:13.120
two weeks, whatever. Okay, this is great,

00:07:13.120 --> 00:07:15.240
but the problem, of course, then is well

00:07:15.240 --> 00:07:16.199
sometimes you're doing too much

00:07:16.199 --> 00:07:18.400
maintenance, it's not really necessary,

00:07:18.400 --> 00:07:20.639
and it still doesn't totally prevent

00:07:20.639 --> 00:07:23.240
this, you know, a failure of the

00:07:23.240 --> 00:07:25.639
machine that occurs outside of your planned

00:07:25.639 --> 00:07:28.680
maintenance, right? So a bit of an

00:07:28.680 --> 00:07:31.160
improvement, but not that much better.

00:07:31.160 --> 00:07:33.280
And then, these last two categories is

00:07:33.280 --> 00:07:34.680
where we bring in AI and machine

00:07:34.680 --> 00:07:36.759
learning. So with machine learning, we're

00:07:36.759 --> 00:07:39.280
going to use sensors to do real-time

00:07:39.280 --> 00:07:41.759
monitoring of the data, and then using

00:07:41.759 --> 00:07:43.319
that data we're going to build a machine

00:07:43.319 --> 00:07:46.479
learning model which helps us to predict,

00:07:46.479 --> 00:07:50.000
with a reasonable level of accuracy, when

00:07:50.000 --> 00:07:52.520
the next failure is going to happen on

00:07:52.520 --> 00:07:54.440
your assembly or production line on a

00:07:54.440 --> 00:07:57.440
specific component or specific machine,

00:07:57.440 --> 00:07:59.520
right? So you just want to be predict to

00:07:59.520 --> 00:08:01.960
a high level of accuracy like maybe

00:08:01.960 --> 00:08:04.440
to the specific day, even the specific

00:08:04.440 --> 00:08:06.400
hour, or even minute itself when you

00:08:06.400 --> 00:08:08.360
expect that particular product to fail

00:08:08.360 --> 00:08:10.960
or the particular machine to fail. All

00:08:10.960 --> 00:08:12.639
right, so these are the advantages of

00:08:12.639 --> 00:08:14.879
predictive maintenance. It minimizes

00:08:14.879 --> 00:08:16.720
the occurrence of unscheduled downtime, it

00:08:16.720 --> 00:08:18.080
gives you a real-time overview of your

00:08:18.080 --> 00:08:19.919
current condition of assets, ensures

00:08:19.919 --> 00:08:22.680
minimal disruptions to productivity,

00:08:22.680 --> 00:08:24.720
optimizes time you spend on maintenance work,

00:08:24.720 --> 00:08:26.639
optimizes the use of spare parts, and so

00:08:26.639 --> 00:08:28.280
on. And of course there are some

00:08:28.280 --> 00:08:30.639
disadvantages, which is the

00:08:30.639 --> 00:08:32.559
primary one, you need a specialized set

00:08:32.559 --> 00:08:35.519
of skills among your engineers to

00:08:35.519 --> 00:08:37.719
understand and create machine learning

00:08:37.719 --> 00:08:40.599
models that can work on the real-time

00:08:40.599 --> 00:08:43.559
data that you're getting. Okay, so we're

00:08:43.559 --> 00:08:45.000
going to take a look at some real life

00:08:45.000 --> 00:08:47.200
use cases. So these are a bunch of links

00:08:47.200 --> 00:08:48.720
here, so if you navigate to these links

00:08:48.720 --> 00:08:50.120
here, you'll be able to get a look at

00:08:50.120 --> 00:08:54.360
some real life use cases of machine

00:08:54.360 --> 00:08:57.640
learning in predictive maintenance. So

00:08:57.640 --> 00:09:00.959
the IBM website, okay, gives you a look at

00:09:00.959 --> 00:09:04.880
a bunch of five use cases, so you can

00:09:04.880 --> 00:09:06.519
click on these links and follow up with

00:09:06.519 --> 00:09:08.279
them if you want to read more. Okay, this

00:09:08.279 --> 00:09:11.480
is waste management, manufacturing, okay,

00:09:11.480 --> 00:09:14.760
building services, and renewable energy,

00:09:14.760 --> 00:09:16.880
and also mining, right? So these are all

00:09:16.880 --> 00:09:18.279
use cases, if you want to know more about

00:09:18.279 --> 00:09:20.480
them, you can read up and follow them

00:09:20.480 --> 00:09:23.600
from this website. And this website

00:09:23.600 --> 00:09:25.760
gives, this is a pretty good website. I

00:09:25.760 --> 00:09:27.720
would really encourage you to just look

00:09:27.720 --> 00:09:28.880
through this if you're interested in

00:09:28.880 --> 00:09:31.160
predictive maintenance. So here, it tells

00:09:31.160 --> 00:09:34.279
you about, you know, an industry survey of

00:09:34.279 --> 00:09:36.360
predictive maintenance. We can see that a

00:09:36.360 --> 00:09:38.200
large portion of the industry,

00:09:38.200 --> 00:09:39.680
manufacturing industry agreed that

00:09:39.680 --> 00:09:41.360
predictive maintenance is a real need to

00:09:41.360 --> 00:09:43.959
stay competitive and predictive

00:09:43.959 --> 00:09:45.240
maintenance is essential for

00:09:45.240 --> 00:09:46.720
manufacturing industry and will gain

00:09:46.720 --> 00:09:48.279
additional strength in the future. So

00:09:48.279 --> 00:09:50.200
this is a survey that was done quite

00:09:50.200 --> 00:09:52.040
some time ago and this was the results

00:09:52.040 --> 00:09:53.880
that we got back. So we can see the vast

00:09:53.880 --> 00:09:55.720
majority of key industry players in the

00:09:55.720 --> 00:09:57.640
manufacturing sector, they consider

00:09:57.640 --> 00:09:59.000
predictive maintenance to be a very

00:09:59.000 --> 00:09:59.839
important

00:09:59.839 --> 00:10:01.600
activity that they want to

00:10:01.600 --> 00:10:04.519
incorporate into their workflow, right?

00:10:04.519 --> 00:10:07.720
And we can see here the kind of ROI that

00:10:07.720 --> 00:10:10.680
we expect on investment in predictive

00:10:10.680 --> 00:10:13.399
maintenance, so 45% reduction in downtime,

00:10:13.399 --> 00:10:17.120
25% growth in productivity, 75% fault

00:10:17.120 --> 00:10:19.480
elimination, 30% reduction in maintenance

00:10:19.480 --> 00:10:22.640
cost, okay? And best of all, if you really

00:10:22.640 --> 00:10:25.040
want to kind of take a look at examples,

00:10:25.040 --> 00:10:26.680
all right, so there are all these

00:10:26.680 --> 00:10:28.120
different companies that have

00:10:28.120 --> 00:10:30.160
significantly invested in predictive

00:10:30.160 --> 00:10:31.640
maintenance technology in their

00:10:31.640 --> 00:10:34.240
manufacturing processes. So PepsiCo, we

00:10:34.240 --> 00:10:38.965
have got Frito-Lay, General Motors, Mondi, Ecoplant,

00:10:38.965 --> 00:10:40.959
all right? So you can jump over here

00:10:40.959 --> 00:10:42.959
and take a look at some of these

00:10:42.959 --> 00:10:46.040
use cases. Let me perhaps, let me try and

00:10:46.040 --> 00:10:48.079
open this up, for example, Mondi, right? You

00:10:48.079 --> 00:10:51.880
can see Mondi has impl- oops. Mondi has used

00:10:51.880 --> 00:10:53.720
this particular piece of software

00:10:53.720 --> 00:10:55.839
called MATLAB, all right, or MathWorks

00:10:55.839 --> 00:10:59.760
sorry, to do predictive maintenance

00:10:59.760 --> 00:11:01.920
for their manufacturing processes using

00:11:01.920 --> 00:11:05.079
machine learning. And we can talk, you can

00:11:05.079 --> 00:11:07.680
study how they have used it, all right,

00:11:07.680 --> 00:11:09.000
and how it works, what was their

00:11:09.000 --> 00:11:10.920
challenge, all right, the problems they

00:11:10.920 --> 00:11:12.639
were facing, the solution that they use

00:11:12.639 --> 00:11:14.560
using this MathWorks Consulting piece of

00:11:14.560 --> 00:11:17.160
software, and data that they collected in

00:11:17.160 --> 00:11:20.399
a MATLAB database, all right, sorry

00:11:20.399 --> 00:11:23.639
in a Oracle database.

00:11:23.639 --> 00:11:26.399
So using MathWorks from MATLAB, all

00:11:26.399 --> 00:11:27.959
right, they were able to create a deep

00:11:27.959 --> 00:11:30.560
learning model to, you know, to

00:11:30.560 --> 00:11:32.839
solve this particular issue for their

00:11:32.839 --> 00:11:35.720
domain. So if you're interested, please, I

00:11:35.720 --> 00:11:37.639
strongly encourage you to read up on all

00:11:37.639 --> 00:11:40.440
these real life customer stories with

00:11:40.440 --> 00:11:43.403
showcase use cases for predictive

00:11:43.403 --> 00:11:48.240
maintenance. Okay, so that's it for

00:11:48.240 --> 00:11:52.200
real life use cases for predictive maintenance.

00:11:53.819 --> 00:11:56.600
Now in this topic, I'm

00:11:56.600 --> 00:11:58.000
going to talk about machine learning

00:11:58.000 --> 00:12:00.040
basics, so what is actually involved

00:12:00.040 --> 00:12:01.480
in machine learning, and I'm going to

00:12:01.480 --> 00:12:03.839
give a very quick, fast, conceptual, high

00:12:03.839 --> 00:12:05.920
level overview of machine learning, all

00:12:05.920 --> 00:12:09.000
right? So there are several categories of

00:12:09.000 --> 00:12:10.959
machine learning, supervised, unsupervised,

00:12:10.959 --> 00:12:13.000
semi-supervised, reinforcement, and deep

00:12:13.000 --> 00:12:15.880
learning, okay? And let's talk about the

00:12:15.880 --> 00:12:19.360
most common and widely used category of

00:12:19.360 --> 00:12:20.560
machine learning which is called

00:12:20.560 --> 00:12:25.040
supervised learning. So the particular use

00:12:25.040 --> 00:12:26.279
case here that I'm going to be

00:12:26.279 --> 00:12:28.560
discussing, predictive maintenance, it's

00:12:28.560 --> 00:12:31.320
basically a form of supervised learning.

00:12:31.320 --> 00:12:33.480
So how does supervised learning work?

00:12:33.480 --> 00:12:35.199
Well in supervised learning, you're going

00:12:35.199 --> 00:12:37.240
to create a machine learning model by

00:12:37.240 --> 00:12:39.360
providing what is called a labelled data

00:12:39.360 --> 00:12:41.680
set as a input to a machine learning

00:12:41.680 --> 00:12:44.680
program or algorithm. And this dataset

00:12:44.680 --> 00:12:46.440
is going to contain what is called an

00:12:46.440 --> 00:12:48.760
independent or feature variables, all

00:12:48.760 --> 00:12:51.240
right, so this will be a set of variables.

00:12:51.240 --> 00:12:52.959
And there will be one dependent or

00:12:52.959 --> 00:12:54.959
target variable which we also call the

00:12:54.959 --> 00:12:57.720
label, and the idea is that the

00:12:57.720 --> 00:12:59.839
independent or the feature variables are

00:12:59.839 --> 00:13:01.600
the attributes or properties of your

00:13:01.600 --> 00:13:04.160
data set that influence the dependent or

00:13:04.160 --> 00:13:07.760
the target variable, okay? So this process

00:13:07.760 --> 00:13:09.120
that I've just described is called

00:13:09.120 --> 00:13:11.600
training the machine learning model, and

00:13:11.600 --> 00:13:14.279
the model is fundamentally a

00:13:14.279 --> 00:13:16.399
mathematical function that best

00:13:16.399 --> 00:13:18.399
approximates the relationship between

00:13:18.399 --> 00:13:20.639
the independent variables and the

00:13:20.639 --> 00:13:22.639
dependent variable. All right, so that's

00:13:22.639 --> 00:13:24.480
quite a bit of a mouthful, so let's jump

00:13:24.480 --> 00:13:26.320
into a diagram that maybe illustrates

00:13:26.320 --> 00:13:27.880
this more clearly. So let's say you have

00:13:27.880 --> 00:13:30.000
a dataset here, an Excel spreadsheet,

00:13:30.000 --> 00:13:32.160
right? And this Excel spreadsheet has a

00:13:32.160 --> 00:13:34.040
bunch of columns here and a bunch of

00:13:34.040 --> 00:13:36.800
rows, okay? So these rows here represent

00:13:36.800 --> 00:13:39.000
observations, or these rows are what

00:13:39.000 --> 00:13:40.959
we call observations or samples or data

00:13:40.959 --> 00:13:43.120
points in our data set, okay? So let's

00:13:43.120 --> 00:13:46.880
assume this data set is gathered by a

00:13:46.880 --> 00:13:49.959
marketing manager at a mall, at a retail

00:13:49.959 --> 00:13:52.279
mall, all right? So they've got all this

00:13:52.279 --> 00:13:54.920
information about the customers who

00:13:54.920 --> 00:13:56.800
purchase products at this mall, all right?

00:13:56.800 --> 00:13:58.519
So some of the information they've

00:13:58.519 --> 00:14:00.000
gotten about the customers are their

00:14:00.000 --> 00:14:01.839
gender, their age, their income, and the

00:14:01.839 --> 00:14:03.600
number of children. So all this

00:14:03.600 --> 00:14:05.680
information about the customers, we call

00:14:05.680 --> 00:14:07.360
this the independent or the feature

00:14:07.360 --> 00:14:10.079
variables, all right? And based on all

00:14:10.079 --> 00:14:12.759
this information about the customer, we

00:14:12.759 --> 00:14:16.199
also managed to get some or we record

00:14:16.199 --> 00:14:17.600
the information about how much the

00:14:17.600 --> 00:14:20.480
customer spends, all right? So this

00:14:20.480 --> 00:14:22.079
information or these numbers here, we call

00:14:22.079 --> 00:14:23.839
this the target variable or the

00:14:23.839 --> 00:14:26.600
dependent variable, right? So on the

00:14:26.600 --> 00:14:29.519
single row, the data point, one single sample, one

00:14:29.519 --> 00:14:32.560
single data point, contains all the data

00:14:32.560 --> 00:14:35.040
for the feature variables and one single

00:14:35.040 --> 00:14:37.800
value for the label or the target

00:14:37.800 --> 00:14:41.199
variable, okay? And the primary purpose of

00:14:41.199 --> 00:14:43.240
the machine learning model is to create

00:14:43.240 --> 00:14:45.519
a mapping from all your feature

00:14:45.519 --> 00:14:48.160
variables to your target variable, so

00:14:48.160 --> 00:14:50.920
somehow there's going to be a function,

00:14:50.920 --> 00:14:52.160
okay, this will be a mathematical

00:14:52.160 --> 00:14:54.800
function that maps all the values of

00:14:54.800 --> 00:14:57.040
your feature variable to the value of

00:14:57.040 --> 00:14:59.639
your target variable. In other words, this

00:14:59.639 --> 00:15:01.279
function represents the relationship

00:15:01.279 --> 00:15:03.360
between your feature variables and your

00:15:03.360 --> 00:15:07.079
target variable, okay? So this whole thing,

00:15:07.079 --> 00:15:08.560
this training process, we call this the

00:15:08.560 --> 00:15:11.320
fitting the model. And the target

00:15:11.320 --> 00:15:13.240
variable or the label, this thing here,

00:15:13.240 --> 00:15:15.120
this column here, or the values here,

00:15:15.120 --> 00:15:17.399
these are critical for providing a

00:15:17.399 --> 00:15:19.000
context to do the fitting or the

00:15:19.000 --> 00:15:21.160
training of the model. And once you've

00:15:21.160 --> 00:15:23.360
got a trained and fitted model, you can

00:15:23.360 --> 00:15:25.959
then use the model to make an accurate

00:15:25.959 --> 00:15:28.319
prediction of target values

00:15:28.319 --> 00:15:30.240
corresponding to new feature values that

00:15:30.240 --> 00:15:32.519
the model has yet to encounter or yet to

00:15:32.519 --> 00:15:34.759
see, and this, as I've already said

00:15:34.759 --> 00:15:36.240
earlier, this is called predictive

00:15:36.240 --> 00:15:38.480
analytics, okay? So let's see what's

00:15:38.480 --> 00:15:40.120
actually happening here, you take your

00:15:40.120 --> 00:15:43.079
training data, all right, so this is this

00:15:43.079 --> 00:15:44.880
whole bunch of data, this data set here

00:15:44.880 --> 00:15:47.440
consisting of a thousand rows of

00:15:47.440 --> 00:15:49.920
data, 10,000 rows of data, you take this

00:15:49.920 --> 00:15:52.040
entire data set, all right, this entire

00:15:52.040 --> 00:15:54.000
data set, you jam it into your machine

00:15:54.000 --> 00:15:56.519
learning algorithm, and a couple of hours

00:15:56.519 --> 00:15:58.079
later your machine learning algorithm

00:15:58.079 --> 00:16:01.360
comes up with a model. And the model is

00:16:01.360 --> 00:16:04.199
essentially a function that maps all

00:16:04.199 --> 00:16:05.959
your feature variables which is these

00:16:05.959 --> 00:16:08.199
four columns here, to your target

00:16:08.199 --> 00:16:10.440
variable which is this one single column

00:16:10.440 --> 00:16:14.279
here, okay? So once you have the model, you

00:16:14.279 --> 00:16:17.040
can put in a new data point. So basically

00:16:17.040 --> 00:16:19.079
the new data point represents data about a

00:16:19.079 --> 00:16:20.959
new customer, a new customer that you

00:16:20.959 --> 00:16:23.120
have never seen before. So let's say

00:16:23.120 --> 00:16:25.079
you've already got information about

00:16:25.079 --> 00:16:27.560
10,000 customers that have visited this

00:16:27.560 --> 00:16:29.920
mall and how much each of these 10,000

00:16:29.920 --> 00:16:31.519
customers have spent when they are at this

00:16:31.519 --> 00:16:34.040
mall. So now you have a totally new

00:16:34.040 --> 00:16:35.800
customer that comes in the mall, this

00:16:35.800 --> 00:16:37.800
customer has never come into this mall

00:16:37.800 --> 00:16:39.839
before, and what we know about this

00:16:39.839 --> 00:16:42.680
customer is that he is a male, the age is

00:16:42.680 --> 00:16:45.199
50, the income is 18, and they have nine

00:16:45.199 --> 00:16:48.160
children. So now when you take this data

00:16:48.160 --> 00:16:50.519
and you pump that into your model, your

00:16:50.519 --> 00:16:52.920
model is going to make a prediction, it's

00:16:52.920 --> 00:16:55.720
going to say, hey, you know what? Based on

00:16:55.720 --> 00:16:57.279
everything that I have been trained before

00:16:57.279 --> 00:16:59.360
and based on the model I've developed,

00:16:59.360 --> 00:17:01.959
I am going to predict that a customer

00:17:01.959 --> 00:17:04.880
that is of a male gender, of the age 50

00:17:04.880 --> 00:17:08.280
with the income of 18, and nine children,

00:17:08.280 --> 00:17:12.400
that customer is going to spend 25 ringgit

00:17:12.400 --> 00:17:15.839
at the mall. And this is it, this is what

00:17:15.839 --> 00:17:18.599
you want. Right there, right here,

00:17:18.599 --> 00:17:21.319
can you see here? That is the final

00:17:21.319 --> 00:17:23.480
output of your machine learning model.

00:17:23.480 --> 00:17:27.359
It's going to make a prediction about

00:17:27.359 --> 00:17:29.760
something that it has not ever seen

00:17:29.760 --> 00:17:32.919
before, okay? That is the core, this is

00:17:32.919 --> 00:17:35.520
essentially the core of machine learning.

00:17:35.520 --> 00:17:38.640
Predictive analytics, making prediction

00:17:38.640 --> 00:17:40.120
about the future

00:17:41.170 --> 00:17:43.799
based on a historical data set.

00:17:44.379 --> 00:17:47.440
Okay, so there are two areas of

00:17:47.440 --> 00:17:49.480
supervised learning, regression and

00:17:49.480 --> 00:17:51.400
classification. So regression is used to

00:17:51.400 --> 00:17:53.440
predict a numerical target variable, such

00:17:53.440 --> 00:17:55.320
as the price of a house or the salary of

00:17:55.320 --> 00:17:57.799
an employee, whereas classification is

00:17:57.799 --> 00:17:59.919
used to predict a categorical target

00:17:59.919 --> 00:18:03.559
variable or class label, okay? So for

00:18:03.559 --> 00:18:05.799
classification you can have either

00:18:05.799 --> 00:18:08.679
binary or multiclass, so, for example,

00:18:08.679 --> 00:18:11.559
binary will be just true or false, zero

00:18:11.559 --> 00:18:14.840
or one. So whether your machine is going

00:18:14.840 --> 00:18:17.360
to fail or is it not going to fail, right?

00:18:17.360 --> 00:18:19.000
So just two classes, two possible,

00:18:19.000 --> 00:18:21.640
outcomes, or is the customer going to

00:18:21.640 --> 00:18:23.679
make a purchase or is the customer not

00:18:23.679 --> 00:18:26.159
going to make a purchase. We call this

00:18:26.159 --> 00:18:28.120
binary classification. And then for

00:18:28.120 --> 00:18:29.679
multiclass, when there are more than two

00:18:29.679 --> 00:18:32.559
classes or types of values. So, for

00:18:32.559 --> 00:18:34.039
example, here this would be a

00:18:34.039 --> 00:18:35.760
classification problem. So if you have a

00:18:35.760 --> 00:18:37.960
data set here, you've got information

00:18:37.960 --> 00:18:39.360
about your customers, you've got your

00:18:39.360 --> 00:18:41.159
gender of the customer, the age of the

00:18:41.159 --> 00:18:42.919
customer, the salary of the customer, and

00:18:42.919 --> 00:18:44.640
you also have record about whether the

00:18:44.640 --> 00:18:47.679
customer made a purchase or not, okay? So

00:18:47.679 --> 00:18:50.080
you can take this data set to train a

00:18:50.080 --> 00:18:52.440
classification model, and then the

00:18:52.440 --> 00:18:54.120
classification model can then make a

00:18:54.120 --> 00:18:56.320
prediction about a new customer, and

00:18:56.320 --> 00:18:58.799
they're going to predict zero which

00:18:58.799 --> 00:19:00.480
means the customer didn't make a

00:19:00.480 --> 00:19:03.159
purchase or one which means the customer

00:19:03.159 --> 00:19:06.320
make a purchase, right? And regression,

00:19:06.320 --> 00:19:08.600
this is regression, so let's say you want

00:19:08.600 --> 00:19:11.280
to predict the wind speed, and you've got

00:19:11.280 --> 00:19:13.799
historical data about all these four

00:19:13.799 --> 00:19:16.559
other independent variables or feature

00:19:16.559 --> 00:19:18.039
variables, so you have recorded

00:19:18.039 --> 00:19:19.640
temperature, the pressure, the relative

00:19:19.640 --> 00:19:21.799
humidity, and the wind direction for the

00:19:21.799 --> 00:19:24.799
past 10 days, 15 days, or whatever, okay? So

00:19:24.799 --> 00:19:26.760
now you are going to train your machine

00:19:26.760 --> 00:19:28.720
learning model using this data set, and

00:19:28.720 --> 00:19:31.679
the target variable column, okay, this

00:19:31.679 --> 00:19:33.760
column here, the label is basically a

00:19:33.760 --> 00:19:37.080
number, right? So now with this number,

00:19:37.080 --> 00:19:39.600
this is a regression model, and so now

00:19:39.600 --> 00:19:41.760
you can put in a new data point, so a new

00:19:41.760 --> 00:19:45.080
data point means a new set of values for

00:19:45.080 --> 00:19:46.960
temperature, pressure, relative humidity,

00:19:46.960 --> 00:19:48.600
and wind direction, and your machine

00:19:48.600 --> 00:19:50.679
learning model will then predict the

00:19:50.679 --> 00:19:53.640
wind speed for that new data point, okay?

00:19:53.640 --> 00:19:57.480
So that's a regression model.

00:19:59.159 --> 00:20:02.280
All right. So in this particular topic

00:20:02.280 --> 00:20:04.919
I'm going to talk about the workflow of

00:20:04.919 --> 00:20:07.960
that's involved in machine learning. So

00:20:07.960 --> 00:20:12.640
in the previous slides, I talked about

00:20:12.640 --> 00:20:14.600
developing the model, all right? But

00:20:14.600 --> 00:20:16.360
that's just one part of the entire

00:20:16.360 --> 00:20:19.080
workflow. So in real life when you use

00:20:19.080 --> 00:20:20.480
machine learning, there's an end-to-end

00:20:20.480 --> 00:20:22.480
workflow that's involved. So the first

00:20:22.480 --> 00:20:24.159
thing, of course, is you need to get your

00:20:24.159 --> 00:20:26.880
data, and then you need to clean your

00:20:26.880 --> 00:20:29.000
data, and then you need to explore your

00:20:29.000 --> 00:20:30.799
data. You need to see what's going on in

00:20:30.799 --> 00:20:33.280
your data set, right? And your data set,

00:20:33.280 --> 00:20:35.720
real life data sets are not trivial, they

00:20:35.720 --> 00:20:38.760
are hundreds of rows, thousands of rows,

00:20:38.760 --> 00:20:40.640
sometimes millions of rows, billions of

00:20:40.640 --> 00:20:43.080
rows, we're talking about billions or

00:20:43.080 --> 00:20:45.120
millions of data points especially if

00:20:45.120 --> 00:20:47.120
you're using an IoT sensor to get data

00:20:47.120 --> 00:20:49.000
in real time. So you've got all these

00:20:49.000 --> 00:20:51.320
super large data sets, you need to clean

00:20:51.320 --> 00:20:53.400
them, and explore them, and then you need

00:20:53.400 --> 00:20:56.360
to prepare them into a right format so

00:20:56.360 --> 00:20:59.600
that you can put them into the training

00:20:59.600 --> 00:21:01.520
process to create your machine learning

00:21:01.520 --> 00:21:04.799
model, and then subsequently you check

00:21:04.799 --> 00:21:07.559
how good is the model, right? How accurate

00:21:07.559 --> 00:21:10.080
is the model in terms of its ability to

00:21:10.080 --> 00:21:12.559
generate predictions for the

00:21:12.559 --> 00:21:14.960
future, right? How accurate are the

00:21:14.960 --> 00:21:16.679
predictions that are coming up from your

00:21:16.679 --> 00:21:18.400
machine learning model. So that's

00:21:18.400 --> 00:21:20.760
validating or evaluating your model, and

00:21:20.760 --> 00:21:22.559
then subsequently if you determine that

00:21:22.559 --> 00:21:25.400
your model is of adequate accuracy to

00:21:25.400 --> 00:21:27.240
meet whatever your domain use case

00:21:27.240 --> 00:21:29.400
requirements are, right? So let's say the

00:21:29.400 --> 00:21:31.440
accuracy that's required for your domain

00:21:31.440 --> 00:21:32.440
use case is

00:21:32.440 --> 00:21:35.320
85%, okay? If my machine learning model

00:21:35.320 --> 00:21:38.520
can give an 85% accuracy rate, I think

00:21:38.520 --> 00:21:40.159
it's good enough, then I'm going to

00:21:40.159 --> 00:21:42.880
deploy it into real world use case. So

00:21:42.880 --> 00:21:45.000
here the machine learning model gets

00:21:45.000 --> 00:21:48.440
deployed on the server, and then other,

00:21:48.440 --> 00:21:50.760
you know, other data sources are going to

00:21:50.760 --> 00:21:52.559
be captured from somewhere. That data is

00:21:52.559 --> 00:21:54.200
pump into the machine learning model. The

00:21:54.200 --> 00:21:55.440
machine learning model generates

00:21:55.440 --> 00:21:57.760
predictions, and those predictions are

00:21:57.760 --> 00:21:59.600
then used to make decisions on the

00:21:59.600 --> 00:22:02.000
factory floor in real time or in any

00:22:02.000 --> 00:22:04.559
other particular scenario. And then you

00:22:04.559 --> 00:22:06.840
constantly monitor and update the model,

00:22:06.840 --> 00:22:09.360
you get more new data, and then the

00:22:09.360 --> 00:22:11.960
entire cycle repeats itself. So that's

00:22:11.960 --> 00:22:14.480
your machine learning workflow, okay, in a

00:22:14.480 --> 00:22:16.919
nutshell. Here's another example of

00:22:16.919 --> 00:22:18.520
the same thing maybe in a slightly

00:22:18.520 --> 00:22:20.039
different format, so, again, you have your

00:22:20.039 --> 00:22:22.159
data collection and preparation. Here we

00:22:22.159 --> 00:22:24.360
talk more about the different kinds of

00:22:24.360 --> 00:22:26.520
algorithms that available to create a

00:22:26.520 --> 00:22:28.120
model, and I'll talk about this more in

00:22:28.120 --> 00:22:30.000
detail when we look at the real world

00:22:30.000 --> 00:22:32.320
example of a end-to-end machine learning

00:22:32.320 --> 00:22:34.559
workflow for the predictive maintenance

00:22:34.559 --> 00:22:36.880
use case. So once you have chosen the

00:22:36.880 --> 00:22:38.840
appropriate algorithm, you then have

00:22:38.840 --> 00:22:41.240
trained your model, you then have

00:22:41.240 --> 00:22:44.080
selected the appropriate train model

00:22:44.080 --> 00:22:46.440
among the multiple models. You are

00:22:46.440 --> 00:22:47.799
probably going to develop multiple

00:22:47.799 --> 00:22:49.559
models from multiple algorithms, you're

00:22:49.559 --> 00:22:51.679
going to evaluate them all, and then

00:22:51.679 --> 00:22:53.200
you're going to say, hey, you know what?

00:22:53.200 --> 00:22:55.279
After I've evaluated and tested that,

00:22:55.279 --> 00:22:57.480
I've chosen the best model, I'm going to

00:22:57.480 --> 00:22:59.640
deploy the model, all right, so this is

00:22:59.640 --> 00:23:02.640
for real life production use, okay? Real

00:23:02.640 --> 00:23:04.279
life sensor data is going to be pumped

00:23:04.279 --> 00:23:06.039
into my model, my model is going to

00:23:06.039 --> 00:23:08.039
generate predictions, the predicted data

00:23:08.039 --> 00:23:10.120
is going to used immediately in real

00:23:10.120 --> 00:23:12.840
time for real life decision making, and

00:23:12.840 --> 00:23:15.000
then I'm going to monitor, right, the

00:23:15.000 --> 00:23:17.440
results. So somebody's using the

00:23:17.440 --> 00:23:19.279
predictions from my model, if the

00:23:19.279 --> 00:23:21.880
predictions are lousy, that goes into the

00:23:21.880 --> 00:23:23.440
monitoring, the monitoring system

00:23:23.440 --> 00:23:25.279
captures that. If the predictions are

00:23:25.279 --> 00:23:27.720
fantastic, well that is also captured by the

00:23:27.720 --> 00:23:29.799
monitoring system, and that gets

00:23:29.799 --> 00:23:32.360
feedback again to the next cycle of my

00:23:32.360 --> 00:23:33.679
machine learning

00:23:33.679 --> 00:23:35.960
pipeline. Okay, so that's the kind of

00:23:35.960 --> 00:23:38.360
overall view, and here are the kind of

00:23:38.360 --> 00:23:41.559
key phases of your workflow. So one of

00:23:41.559 --> 00:23:43.960
the important phases is called EDA,

00:23:43.960 --> 00:23:47.520
exploratory data analysis and in this

00:23:47.520 --> 00:23:49.880
particular phase, you're going to

00:23:49.880 --> 00:23:53.120
do a lot of stuff, primarily just to

00:23:53.120 --> 00:23:54.880
understand your data set. So like I said,

00:23:54.880 --> 00:23:56.559
real life data sets, they tend to be very

00:23:56.559 --> 00:23:59.320
complex, and they tend to have various

00:23:59.320 --> 00:24:01.039
statistical properties, all right,

00:24:01.039 --> 00:24:02.679
statistics is a very important component

00:24:02.679 --> 00:24:05.600
of machine learning. So an EDA helps you

00:24:05.600 --> 00:24:07.480
to kind of get an overview of your data

00:24:07.480 --> 00:24:09.679
set, get an overview of any problems in

00:24:09.679 --> 00:24:11.520
your data set like any data that's

00:24:11.520 --> 00:24:13.440
missing, the statistical properties of your

00:24:13.440 --> 00:24:15.159
data set, the distribution of your data

00:24:15.159 --> 00:24:17.279
set, the statistical correlation of

00:24:17.279 --> 00:24:19.190
variables in your data set, etc,

00:24:19.190 --> 00:24:23.400
etc. Okay, then we have data cleaning or

00:24:23.400 --> 00:24:25.279
sometimes you call it data cleansing, and

00:24:25.279 --> 00:24:27.600
in this phase what you want to do is

00:24:27.600 --> 00:24:29.440
primarily, you want to kind of do things

00:24:29.440 --> 00:24:31.960
like remove duplicate records or rows in

00:24:31.960 --> 00:24:33.679
your table, you want to make sure that

00:24:33.679 --> 00:24:36.799
your data or your data

00:24:36.799 --> 00:24:39.399
points or your samples have appropriate IDs,

00:24:39.399 --> 00:24:41.080
and most importantly, you want to make

00:24:41.080 --> 00:24:43.039
sure there's not too many missing values

00:24:43.039 --> 00:24:44.880
in your data set. So what I mean by

00:24:44.880 --> 00:24:46.320
missing values are things like that,

00:24:46.320 --> 00:24:48.200
right? You have got a data set, and for

00:24:48.200 --> 00:24:51.640
some reason there are some cells or

00:24:51.640 --> 00:24:54.559
locations in your data set which are

00:24:54.559 --> 00:24:56.520
missing values, right? And if you have a

00:24:56.520 --> 00:24:58.679
lot of these missing values, then you've

00:24:58.679 --> 00:25:00.440
got a poor quality data set, and you're

00:25:00.440 --> 00:25:02.200
not going to be able to build a good

00:25:02.200 --> 00:25:04.159
model from this data set. You're not

00:25:04.159 --> 00:25:06.000
going to be able to train a good machine

00:25:06.000 --> 00:25:08.120
learning model from a data set with a

00:25:08.120 --> 00:25:10.200
lot of missing values like this. So you

00:25:10.200 --> 00:25:11.880
have to figure out whether there are a

00:25:11.880 --> 00:25:13.399
lot of missing values in your data set,

00:25:13.399 --> 00:25:15.399
how do you handle them. Another thing

00:25:15.399 --> 00:25:16.919
that's important in data cleansing is

00:25:16.919 --> 00:25:18.799
figuring out the outliers in your data

00:25:18.799 --> 00:25:21.919
set. So outliers are things like this,

00:25:21.919 --> 00:25:24.039
you know, data points that are very far from

00:25:24.039 --> 00:25:26.440
the general trend of data points in your

00:25:26.440 --> 00:25:29.559
data set, right? And so there are also

00:25:29.559 --> 00:25:31.919
several ways to detect outliers in your

00:25:31.919 --> 00:25:34.200
data set, and there are several ways to

00:25:34.200 --> 00:25:36.640
handle outliers in your data set.

00:25:36.640 --> 00:25:38.200
Similarly as well, there are several ways

00:25:38.200 --> 00:25:39.960
to handle missing values in your data

00:25:39.960 --> 00:25:42.880
set. So handling missing values, handling

00:25:42.880 --> 00:25:45.679
outliers, those are really two very key

00:25:45.679 --> 00:25:47.279
importance of data

00:25:47.279 --> 00:25:49.120
cleansing, and there are many, many

00:25:49.120 --> 00:25:50.760
techniques to handle this, so a data

00:25:50.760 --> 00:25:52.000
scientist needs to be acquainted with

00:25:52.000 --> 00:25:55.360
all of this. All right, why do I need to

00:25:55.360 --> 00:25:58.000
do data cleansing? Well, here is the key

00:25:58.000 --> 00:25:59.360
point.

00:25:59.360 --> 00:26:02.799
If you have a very poor quality data set,

00:26:02.799 --> 00:26:04.880
which means you've got a lot of outliers

00:26:04.880 --> 00:26:06.720
which are errors in your data set, or you

00:26:06.720 --> 00:26:08.159
got a lot of missing values in your data

00:26:08.159 --> 00:26:10.840
set, even though you've got a fantastic

00:26:10.840 --> 00:26:13.039
algorithm, you've got a fantastic model,

00:26:13.039 --> 00:26:15.720
the predictions that your model is going

00:26:15.720 --> 00:26:18.960
to give is absolutely rubbish. It's kind

00:26:18.960 --> 00:26:22.080
of like taking water and putting water

00:26:22.080 --> 00:26:26.000
into the tank of a Mercedes-Benz. So

00:26:26.000 --> 00:26:28.440
Mercedes-Benz is a great car, but if you

00:26:28.440 --> 00:26:30.080
take water and put it into your

00:26:30.080 --> 00:26:33.399
Mercedes-Benz, it will just die, right? Your

00:26:33.399 --> 00:26:36.520
car will just die, it can't run on water,

00:26:36.520 --> 00:26:38.279
right? On the other hand, if you have a

00:26:38.279 --> 00:26:41.559
Myvi, Myvi is just a lousy, shit car, but if

00:26:41.559 --> 00:26:44.840
you take a high octane, good petrol and

00:26:44.840 --> 00:26:47.240
you put into a Myvi, the Myvi will just go at,

00:26:47.240 --> 00:26:49.480
you know, 100 miles an hour. It would just

00:26:49.480 --> 00:26:51.159
completely destroy the Mercedes-Benz in

00:26:51.159 --> 00:26:53.360
terms of performance, so it

00:26:53.360 --> 00:26:54.799
doesn't really matter what model you're

00:26:54.799 --> 00:26:57.080
using here, right? So you can be using the most

00:26:57.080 --> 00:26:58.679
fantastic model like the

00:26:58.679 --> 00:27:01.200
Mercedes-Benz or machine learning, but if

00:27:01.200 --> 00:27:03.080
your data is lousy quality, your

00:27:03.080 --> 00:27:06.480
predictions is also going to be rubbish,

00:27:06.480 --> 00:27:10.000
okay? So cleansing data set is, in fact,

00:27:10.000 --> 00:27:11.880
probably the most important thing that

00:27:11.880 --> 00:27:13.640
data scientists need to do and that's

00:27:13.640 --> 00:27:15.520
what they spend most of the time doing,

00:27:15.520 --> 00:27:17.600
right, building the model, training the

00:27:17.600 --> 00:27:20.240
model, getting the right algorithms, and

00:27:20.240 --> 00:27:23.240
so on, that's really a small portion of

00:27:23.240 --> 00:27:25.200
the actual machine learning workflow,

00:27:25.200 --> 00:27:27.360
right? The actual machine learning

00:27:27.360 --> 00:27:29.679
workflow, the vast majority of time is on

00:27:29.679 --> 00:27:31.559
cleaning and organizing your

00:27:31.559 --> 00:27:33.360
data. Then you have something called

00:27:33.360 --> 00:27:35.080
feature engineering which is you

00:27:35.080 --> 00:27:37.000
preprocess the feature variables of

00:27:37.000 --> 00:27:38.919
your original data set prior to using

00:27:38.919 --> 00:27:40.600
them to train the model, and this is

00:27:40.600 --> 00:27:41.960
either through addition, deletion,

00:27:41.960 --> 00:27:43.600
combination, or transformation of these

00:27:43.600 --> 00:27:45.399
variables. And then the idea is you want

00:27:45.399 --> 00:27:47.000
to improve the predictive accuracy of

00:27:47.000 --> 00:27:49.320
the model, and also because some models

00:27:49.320 --> 00:27:51.080
can only work with numeric data, so you

00:27:51.080 --> 00:27:53.720
need to transform categorical data into

00:27:53.720 --> 00:27:57.039
numeric data. All right, so just now, in

00:27:57.039 --> 00:27:58.799
the earlier slides, I showed you that you

00:27:58.799 --> 00:28:00.760
take your original data set, you pump it

00:28:00.760 --> 00:28:03.200
into algorithm, and then a couple of hours

00:28:03.200 --> 00:28:05.200
later, you get a machine learning model,

00:28:05.200 --> 00:28:08.640
right? So you didn't do anything to your

00:28:08.640 --> 00:28:10.159
data set, to the feature variables in

00:28:10.159 --> 00:28:12.159
your data set before you pump it into a

00:28:12.159 --> 00:28:14.399
machine learning algorithm. So

00:28:14.399 --> 00:28:15.840
what I showed you earlier is you just

00:28:15.840 --> 00:28:18.919
take the data set exactly as it is and

00:28:18.919 --> 00:28:20.799
you just pump it into the algorithm,

00:28:20.799 --> 00:28:23.120
couple of hours later, you get a model,

00:28:23.120 --> 00:28:27.640
right? But that's not what generally

00:28:27.640 --> 00:28:29.600
happens in in real life. In real life,

00:28:29.600 --> 00:28:31.559
you're going to take all the original

00:28:31.559 --> 00:28:34.320
feature variables from your data set and

00:28:34.320 --> 00:28:36.720
you're going to transform them in some

00:28:36.720 --> 00:28:38.960
way. So you can see here these are the

00:28:38.960 --> 00:28:42.120
columns of data from my original data set,

00:28:42.120 --> 00:28:46.039
and before I actually put all these data

00:28:46.039 --> 00:28:48.240
points from my original data set into my

00:28:48.240 --> 00:28:50.720
algorithm to train and get my model, I

00:28:50.720 --> 00:28:54.960
will actually transform them, okay? So the

00:28:54.960 --> 00:28:57.600
transformation of these feature variable

00:28:57.600 --> 00:29:00.600
values, we call this feature engineering.

00:29:00.600 --> 00:29:02.440
And there are many, many techniques to do

00:29:02.440 --> 00:29:04.960
feature engineering, so one-hot encoding,

00:29:04.960 --> 00:29:08.279
scaling, log transformation,

00:29:08.279 --> 00:29:10.480 line:1
discretization, date extraction, boolean

00:29:10.480 --> 00:29:12.039
logic, etc, etc.

00:29:12.039 --> 00:29:14.880
Okay, then finally we do something

00:29:14.880 --> 00:29:16.799
called a train-test split, so where we

00:29:16.799 --> 00:29:19.440
take our original dataset, right? So this

00:29:19.440 --> 00:29:21.360
was the original dataset, and we break

00:29:21.360 --> 00:29:23.720
it into two parts, so one is called the

00:29:23.720 --> 00:29:25.760
training dataset and the other is

00:29:25.760 --> 00:29:28.120
called the test dataset. And the primary

00:29:28.120 --> 00:29:30.000
purpose for this is when we feed and

00:29:30.000 --> 00:29:31.399
train the machine learning model, we're

00:29:31.399 --> 00:29:32.640
going to use what is called the training

00:29:32.640 --> 00:29:35.559
dataset, and when we want to evaluate

00:29:35.559 --> 00:29:37.399
the accuracy of the model, right? So this

00:29:37.399 --> 00:29:40.960
is the key part of your machine learning

00:29:40.960 --> 00:29:43.640
life cycle because you are not only just

00:29:43.640 --> 00:29:45.440
going to have one possible models

00:29:45.440 --> 00:29:47.720
because there are a vast range of

00:29:47.720 --> 00:29:50.080
algorithms that you can use to create a

00:29:50.080 --> 00:29:53.000
model. So fundamentally you have a wide

00:29:53.000 --> 00:29:55.679
range of choices, right, like wide range

00:29:55.679 --> 00:29:57.640
of cars, right? You want to buy a car, you

00:29:57.640 --> 00:30:00.559
can buy a Myvi, you can buy a Perodua,

00:30:00.559 --> 00:30:02.640
you can buy a Honda you can buy a

00:30:02.640 --> 00:30:05.039
mercedesbenz you can buy a Audi you can

00:30:05.039 --> 00:30:07.760
buy a beamer many many different cars

00:30:07.760 --> 00:30:09.240
you that available for you if you want

00:30:09.240 --> 00:30:11.679
to buy a car right same thing with a

00:30:11.679 --> 00:30:14.360
machine learning model that are aast

00:30:14.360 --> 00:30:16.720
variety of algorithms that you can

00:30:16.720 --> 00:30:19.480
choose from in order to create a model

00:30:19.480 --> 00:30:21.519
and so once you create a model from a

00:30:21.519 --> 00:30:24.480
given algorithm you need to say hey how

00:30:24.480 --> 00:30:26.440
accurate is this model that have created

00:30:26.440 --> 00:30:28.640
from this algorithm and and different

00:30:28.640 --> 00:30:30.399
algorithms are going to create different

00:30:30.399 --> 00:30:33.720
models with different rates of accuracy

00:30:33.720 --> 00:30:35.679
and so the primary purpose of the test

00:30:35.679 --> 00:30:38.200
data set is to evaluate the ACC accuracy

00:30:38.200 --> 00:30:41.480
of the model to see hey is this model

00:30:41.480 --> 00:30:43.360
that I've created using this algorithm

00:30:43.360 --> 00:30:45.880
is it adequate for me to use in a real

00:30:45.880 --> 00:30:48.600
life production use case Okay so that's

00:30:48.600 --> 00:30:52.320
what it's all about okay so this is my

00:30:52.320 --> 00:30:54.279
original data set I break it into my

00:30:54.279 --> 00:30:56.559
feature data uh feature data set and

00:30:56.559 --> 00:30:58.519
also my target variable colum so my

00:30:58.519 --> 00:31:00.639
feature variable uh colums the target

00:31:00.639 --> 00:31:02.200
variable colums and then I further break

00:31:02.200 --> 00:31:04.240
it into a training data set and a test

00:31:04.240 --> 00:31:06.600
data set the training data set is to use

00:31:06.600 --> 00:31:08.320
the train to create the machine learning

00:31:08.320 --> 00:31:10.480
model and then once the machine learning

00:31:10.480 --> 00:31:12.200
model is created I then use the test

00:31:12.200 --> 00:31:15.080
data set to evaluate the accuracy of the

00:31:15.080 --> 00:31:16.279
machine learning

00:31:16.279 --> 00:31:21.000
model all right and then finally we can

00:31:21.000 --> 00:31:23.200
see what are the different parts or

00:31:23.200 --> 00:31:26.080
aspects that go into a successful model

00:31:26.080 --> 00:31:29.519
so Eda about 10% data cleansing about

00:31:29.519 --> 00:31:32.360
20% feature engineering about

00:31:32.360 --> 00:31:36.320
25% selecting a specific algorithm about

00:31:36.320 --> 00:31:39.120
10% and then training the model from

00:31:39.120 --> 00:31:41.639
that algorithm about 15% and then

00:31:41.639 --> 00:31:43.679
finally evaluating the model deciding

00:31:43.679 --> 00:31:45.960
which is the best model with the highest

00:31:45.960 --> 00:31:50.679
accuracy rate that's about

00:31:54.080 --> 00:31:56.919
20% all right so we have reached the

00:31:56.919 --> 00:31:58.880
most interesting part of this

00:31:58.880 --> 00:32:01.039
presentation which is the demonstration

00:32:01.039 --> 00:32:03.760
of an endtoend machine learning workflow

00:32:03.760 --> 00:32:06.080
on a real life data set that

00:32:06.080 --> 00:32:10.080
demonstrates the use case of predictive

00:32:10.080 --> 00:32:13.519
maintenance so the for the data set for

00:32:13.519 --> 00:32:16.240
this particular use case I've used a

00:32:16.240 --> 00:32:19.200
data set from kegle so for those of you

00:32:19.200 --> 00:32:21.399
are not aware of this kegle is the

00:32:21.399 --> 00:32:24.880
world's largest open-source Community

00:32:24.880 --> 00:32:28.080
for data science and Ai and they have a

00:32:28.080 --> 00:32:31.159
large collection of data sets from all

00:32:31.159 --> 00:32:34.440
various uh areas of industry and human

00:32:34.440 --> 00:32:37.039
endeavor and they also have a large

00:32:37.039 --> 00:32:38.840
collection of models that have been

00:32:38.840 --> 00:32:42.880
developed using these data sets so here

00:32:42.880 --> 00:32:47.039
we have a data set for the particular

00:32:47.039 --> 00:32:50.519
use case predictive maintenance okay so

00:32:50.519 --> 00:32:52.919
this is some information about the data

00:32:52.919 --> 00:32:56.440
set uh so in case um you do not know how

00:32:56.440 --> 00:32:59.200
to get to there this is the URL to click

00:32:59.200 --> 00:33:02.240
on okay to get to that data set so once

00:33:02.240 --> 00:33:05.120
you at the data set here you can or the

00:33:05.120 --> 00:33:07.399
page for about this data set you can see

00:33:07.399 --> 00:33:09.960
all the information about this data set

00:33:09.960 --> 00:33:13.039
and you can download the data set in a

00:33:13.039 --> 00:33:14.159
CSV

00:33:14.159 --> 00:33:16.360
format okay so let's take a look at the

00:33:16.360 --> 00:33:19.559
data set so this data set has a total of

00:33:19.559 --> 00:33:23.440
10,000 samples okay and these are the

00:33:23.440 --> 00:33:26.279
feature variables the type the product

00:33:26.279 --> 00:33:28.440
ID the add temperature process

00:33:28.440 --> 00:33:31.000
temperature rotational speed talk tool

00:33:31.000 --> 00:33:34.799
Weare and this is the target variable

00:33:34.799 --> 00:33:36.720
all right so the target variable is what

00:33:36.720 --> 00:33:38.159
we are interested in what we are

00:33:38.159 --> 00:33:40.960
interested in using to train the machine

00:33:40.960 --> 00:33:42.600
learning model and also what we

00:33:42.600 --> 00:33:45.279
interested to predict okay so these are

00:33:45.279 --> 00:33:47.960
the feature variables they describe or

00:33:47.960 --> 00:33:49.960
they provide information about this

00:33:49.960 --> 00:33:52.880
particular machine on the production

00:33:52.880 --> 00:33:55.080
line on the assembly line so you might

00:33:55.080 --> 00:33:56.799
know the product ID the type the air

00:33:56.799 --> 00:33:58.120
temperature process temperature

00:33:58.120 --> 00:34:00.480
rotational speed talk to where right so

00:34:00.480 --> 00:34:03.159
let's say you've got a iot sensor system

00:34:03.159 --> 00:34:06.120
that's basically capturing all this data

00:34:06.120 --> 00:34:08.359
about a product or a machine on your

00:34:08.359 --> 00:34:10.679
production or assembly line okay and

00:34:10.679 --> 00:34:13.918
you've also captured information about

00:34:13.918 --> 00:34:17.199
whether is for a specific uh sample

00:34:17.199 --> 00:34:19.839
whether that sample uh experien a

00:34:19.839 --> 00:34:23.040
failure or not okay so the target value

00:34:23.040 --> 00:34:25.520
of zero okay indicates that there's no

00:34:25.520 --> 00:34:28.000
failure so zero means no failure and we

00:34:28.000 --> 00:34:30.199
can see that the vast majority of data

00:34:30.199 --> 00:34:32.520
points in this data set are no failure

00:34:32.520 --> 00:34:34.000
and here we can see an example here

00:34:34.000 --> 00:34:36.719
where you have a case of a failure so a

00:34:36.719 --> 00:34:40.159
failure is marked as a one positive and

00:34:40.159 --> 00:34:42.639
no failure is marked as zero negative

00:34:42.639 --> 00:34:44.879
all right so here we have one type of a

00:34:44.879 --> 00:34:47.040
failure it's called a power failure and

00:34:47.040 --> 00:34:49.000
if you scroll down the data set you see

00:34:49.000 --> 00:34:50.399
there are also other kinds of failures

00:34:50.399 --> 00:34:52.839
like a towar

00:34:52.839 --> 00:34:56.960
failure uh we have a over strain failure

00:34:56.960 --> 00:34:58.680
here for example

00:34:58.680 --> 00:35:00.760
uh we also have a power failure again

00:35:00.760 --> 00:35:02.200
and so on so if you scroll down through

00:35:02.200 --> 00:35:04.160
these 10,000 data points and or if

00:35:04.160 --> 00:35:06.040
you're familiar with using Excel to

00:35:06.040 --> 00:35:08.839
filter out values in a colume you can

00:35:08.839 --> 00:35:12.280
see that in this particular colume here

00:35:12.280 --> 00:35:14.480
which is the so-called Target variable

00:35:14.480 --> 00:35:16.960
colume you are going to have the vast

00:35:16.960 --> 00:35:18.920
majority of values as zero which means

00:35:18.920 --> 00:35:22.760
no failure and some of the rows or the

00:35:22.760 --> 00:35:24.040
data points you are going to have a

00:35:24.040 --> 00:35:26.359
value of one and for those rows that you

00:35:26.359 --> 00:35:28.119
have a value of one for example example

00:35:28.119 --> 00:35:31.280
here you are sorry for example here you

00:35:31.280 --> 00:35:32.839
are going to have different types of

00:35:32.839 --> 00:35:34.640
failure so like I said just now power

00:35:34.640 --> 00:35:38.960
failure tool set filia etc etc so we are

00:35:38.960 --> 00:35:40.640
going to go through the entire machine

00:35:40.640 --> 00:35:43.599
learning workflow process with this data

00:35:43.599 --> 00:35:46.640
set so to see an example of that we are

00:35:46.640 --> 00:35:50.400
going to use a we're going to go to the

00:35:50.400 --> 00:35:52.280
code section here all right so if I

00:35:52.280 --> 00:35:54.280
click on the code section here and right

00:35:54.280 --> 00:35:56.400
down here we have see what is called a

00:35:56.400 --> 00:35:59.359
data set notebook so this is basically a

00:35:59.359 --> 00:36:02.319
Jupiter notebook Jupiter is basically an

00:36:02.319 --> 00:36:05.280
python application which allows you to

00:36:05.280 --> 00:36:09.240
create a python machine learning

00:36:09.240 --> 00:36:11.680
program that basically builds your

00:36:11.680 --> 00:36:14.520
machine learning model assesses or

00:36:14.520 --> 00:36:16.480
evaluates his accuracy and generates

00:36:16.480 --> 00:36:19.040
predictions from it okay so here we have

00:36:19.040 --> 00:36:21.680
a whole bunch of Jupiter notebooks that

00:36:21.680 --> 00:36:24.560
are available and you can select any one

00:36:24.560 --> 00:36:26.000
of them all these notebooks are

00:36:26.000 --> 00:36:28.720
essentially going to process the data

00:36:28.720 --> 00:36:31.720
from this particular data set so if I go

00:36:31.720 --> 00:36:34.720
to this code page here I've actually

00:36:34.720 --> 00:36:37.319
selected a specific notebook that I'm

00:36:37.319 --> 00:36:39.960
going to run through to demonstrate an

00:36:39.960 --> 00:36:42.839
endtoend machine learning workflow using

00:36:42.839 --> 00:36:45.560
various machine learning libraries from

00:36:45.560 --> 00:36:49.800
the Python programming language okay so

00:36:49.800 --> 00:36:52.440
the uh particular notebook I'm going to

00:36:52.440 --> 00:36:55.160
use is this particular notebook here and

00:36:55.160 --> 00:36:57.160
you can also get the URL for that

00:36:57.160 --> 00:37:00.440
particular The Notebook from

00:37:00.440 --> 00:37:03.760
here okay so let's quickly do a quick

00:37:03.760 --> 00:37:06.000
revision again what are we trying to do

00:37:06.000 --> 00:37:08.000
here we're trying to build a machine

00:37:08.000 --> 00:37:11.359
learning classification model right so

00:37:11.359 --> 00:37:12.960
we said there are two primary areas of

00:37:12.960 --> 00:37:14.560
supervised learning one is regression

00:37:14.560 --> 00:37:16.200
which is used to predict a numerical

00:37:16.200 --> 00:37:18.640
Target variable and the second kind of

00:37:18.640 --> 00:37:21.359
supervised learning is classification

00:37:21.359 --> 00:37:23.079
which is what we're doing here we're

00:37:23.079 --> 00:37:25.839
trying to predict a categorical Target

00:37:25.839 --> 00:37:29.680
variable okay so in this particular

00:37:29.680 --> 00:37:32.119
example we actually have two kinds of

00:37:32.119 --> 00:37:34.480
ways we can classify either a binary

00:37:34.480 --> 00:37:37.560
classification or a multiclass

00:37:37.560 --> 00:37:39.520
classification so for binary

00:37:39.520 --> 00:37:41.440
classification we are only going to

00:37:41.440 --> 00:37:43.400
classify the product or machine as

00:37:43.400 --> 00:37:47.160
either it failed or it did not fail okay

00:37:47.160 --> 00:37:48.880
so if we go back to the data set that I

00:37:48.880 --> 00:37:50.839
showed you just now if you look at this

00:37:50.839 --> 00:37:52.680
target variable colume there are only

00:37:52.680 --> 00:37:54.520
two possible values here they either

00:37:54.520 --> 00:37:58.280
zero or one zero means there's no fi

00:37:58.280 --> 00:38:01.240
one means that's a failure okay so this

00:38:01.240 --> 00:38:03.440
is an example of a binary classification

00:38:03.440 --> 00:38:07.240
only two possible outcomes zero or one

00:38:07.240 --> 00:38:10.119
didn't fail or fail all right two

00:38:10.119 --> 00:38:13.079
possible outcomes and then we can also

00:38:13.079 --> 00:38:15.480
for the same data set we can extend it

00:38:15.480 --> 00:38:18.079
and make it a multiclass classification

00:38:18.079 --> 00:38:20.880
problem all right so if we kind of want

00:38:20.880 --> 00:38:23.720
to drill down further we can say that

00:38:23.720 --> 00:38:26.800
not only is there a failure we can

00:38:26.800 --> 00:38:29.200
actually say that are different types of

00:38:29.200 --> 00:38:32.440
failures okay so we have one category of

00:38:32.440 --> 00:38:35.599
class that is basically no failure okay

00:38:35.599 --> 00:38:37.400
then we have a category for the

00:38:37.400 --> 00:38:40.400
different types of failures right so you

00:38:40.400 --> 00:38:43.920
can have a power failure you could have

00:38:43.920 --> 00:38:46.400
a tool Weare

00:38:46.400 --> 00:38:48.920
failure uh you could have let's go down

00:38:48.920 --> 00:38:50.880
here you could have a over strain

00:38:50.880 --> 00:38:53.760
failure and etc etc so you can have

00:38:53.760 --> 00:38:57.160
multiple classes of failure in addition

00:38:57.160 --> 00:39:00.520
to the general overall or the majority

00:39:00.520 --> 00:39:04.319
class of no failure and that would be a

00:39:04.319 --> 00:39:06.680
multiclass classification problem so

00:39:06.680 --> 00:39:08.400
with this data set we are going to see

00:39:08.400 --> 00:39:11.040
how to make it a binary classification

00:39:11.040 --> 00:39:12.800
problem and also a multiclass

00:39:12.800 --> 00:39:15.079
classification problem okay so let's

00:39:15.079 --> 00:39:16.880
look at the workflow so let's say we've

00:39:16.880 --> 00:39:18.880
already got the data so right now we do

00:39:18.880 --> 00:39:20.839
have the data set this is the data set

00:39:20.839 --> 00:39:22.720
that we have so let's assume we've

00:39:22.720 --> 00:39:24.560
somehow managed to get this data set

00:39:24.560 --> 00:39:26.880
from some iot sensors that are

00:39:26.880 --> 00:39:29.119
monitoring realtime data in our

00:39:29.119 --> 00:39:31.079
production environment on the assembly

00:39:31.079 --> 00:39:32.800
line on the production line we've got

00:39:32.800 --> 00:39:34.680
sensors reading data that gives us all

00:39:34.680 --> 00:39:37.960
these data that we have in this CSV file

00:39:37.960 --> 00:39:40.079
Okay so we've already got the data we've

00:39:40.079 --> 00:39:41.599
retrieved the data now we're going to go

00:39:41.599 --> 00:39:45.000
on to the cleaning and exploration part

00:39:45.000 --> 00:39:47.520
of your machine learning life cycle all

00:39:47.520 --> 00:39:49.800
right so let's look at the data cleaning

00:39:49.800 --> 00:39:51.400
part so the data cleaning part we

00:39:51.400 --> 00:39:53.720
interested in uh checking for missing

00:39:53.720 --> 00:39:56.200
values and maybe removing the rows you

00:39:56.200 --> 00:39:58.079
missing values okay

00:39:58.079 --> 00:39:59.760
uh so the kind of things we can sorry

00:39:59.760 --> 00:40:01.000
the kind of things we can do in missing

00:40:01.000 --> 00:40:02.880
values we can remove the row missing

00:40:02.880 --> 00:40:05.839
values we can put in some new values uh

00:40:05.839 --> 00:40:08.000
some replacement values which could be a

00:40:08.000 --> 00:40:09.880
average of all the values in that that

00:40:09.880 --> 00:40:12.880
particular colume etc etc we also try to

00:40:12.880 --> 00:40:15.480
identify outliers in our data set and

00:40:15.480 --> 00:40:17.480
also there are a variety of ways to deal

00:40:17.480 --> 00:40:19.480
with that so this is called Data

00:40:19.480 --> 00:40:21.359
cleansing which is a really important

00:40:21.359 --> 00:40:23.319
part of your machine learning workflow

00:40:23.319 --> 00:40:25.520
right so that's where we are now at

00:40:25.520 --> 00:40:26.839
we're doing cleansing and then we're

00:40:26.839 --> 00:40:28.839
going to follow up with

00:40:28.839 --> 00:40:31.160
exploration so let's look at the actual

00:40:31.160 --> 00:40:33.160
code that does the cleansing here so

00:40:33.160 --> 00:40:35.800
here we are right at the start of the uh

00:40:35.800 --> 00:40:38.400
machine learning uh life cycle here so

00:40:38.400 --> 00:40:40.839
this is a Jupiter notebook so here we

00:40:40.839 --> 00:40:43.359
have a brief description of the problem

00:40:43.359 --> 00:40:45.920
statement all right so this data set

00:40:45.920 --> 00:40:47.640
reflects real life predictive

00:40:47.640 --> 00:40:49.240
maintenance enounter industry with

00:40:49.240 --> 00:40:50.480
measurements from real equipment the

00:40:50.480 --> 00:40:52.400
features description is taken directly

00:40:52.400 --> 00:40:54.520
from the data source set so here we have

00:40:54.520 --> 00:40:57.400
a description of the six key features in

00:40:57.400 --> 00:40:59.599
our data set type which is the quality

00:40:59.599 --> 00:41:02.520
of the product the air temperature the

00:41:02.520 --> 00:41:04.680
process temperature the rotational speed

00:41:04.680 --> 00:41:06.599
the talk and the towar all right so

00:41:06.599 --> 00:41:08.880
these are the six feature variables and

00:41:08.880 --> 00:41:11.319
there are the two target variables so

00:41:11.319 --> 00:41:13.119
just now I showed you just now there's

00:41:13.119 --> 00:41:15.119
one target variable which only has two

00:41:15.119 --> 00:41:17.440
possible values either zero or one okay

00:41:17.440 --> 00:41:20.079
zero or one means failure or no failure

00:41:20.079 --> 00:41:23.079
so that will be this colume here right

00:41:23.079 --> 00:41:24.880
so let me go all the way back up to here

00:41:24.880 --> 00:41:26.640
so this colume here we already saw it

00:41:26.640 --> 00:41:29.440
only has two I values is either zero or

00:41:29.440 --> 00:41:32.680
one and then we also have this column

00:41:32.680 --> 00:41:35.040
here and this column here is basically

00:41:35.040 --> 00:41:38.079
the failure type and so the we have as I

00:41:38.079 --> 00:41:40.800
already demonstrated just now we do have

00:41:40.800 --> 00:41:43.440
uh several categories of or types of

00:41:43.440 --> 00:41:45.560
failure and so here we call this

00:41:45.560 --> 00:41:47.079
multiclass

00:41:47.079 --> 00:41:50.000
classification so we can either build a

00:41:50.000 --> 00:41:51.839
binary classification model for this

00:41:51.839 --> 00:41:53.520
problem domain or we can build a

00:41:53.520 --> 00:41:55.079
multiclass

00:41:55.079 --> 00:41:58.119
classification problem all right so this

00:41:58.119 --> 00:41:59.839
jupyter notebook is going to demonstrate

00:41:59.839 --> 00:42:02.319
both approaches to us so first step we

00:42:02.319 --> 00:42:04.800
are going to write all this python code

00:42:04.800 --> 00:42:06.880
that's going to import all the libraries

00:42:06.880 --> 00:42:09.079
that we need to use okay so this is

00:42:09.079 --> 00:42:12.319
basically python code okay and it's

00:42:12.319 --> 00:42:15.119
importing the relevant machine learn

00:42:15.119 --> 00:42:17.960
oops we are importing the relevant

00:42:17.960 --> 00:42:20.599
machine learning libraries related to

00:42:20.599 --> 00:42:23.520
our domain use case okay then we load in

00:42:23.520 --> 00:42:26.440
our data set okay so this our data set

00:42:26.440 --> 00:42:28.319
we describe it we have some quick

00:42:28.319 --> 00:42:30.920
insights into the data set um and then

00:42:30.920 --> 00:42:32.839
we just take a look at all the variables

00:42:32.839 --> 00:42:36.000
of the feature variables Etc and so on

00:42:36.000 --> 00:42:38.000
we just what we're doing now is just

00:42:38.000 --> 00:42:39.800
doing a quick overview of the data set

00:42:39.800 --> 00:42:41.559
so this all this python code here they

00:42:41.559 --> 00:42:43.760
were writing is allowing us the data

00:42:43.760 --> 00:42:45.359
scientist to get a quick overview of our

00:42:45.359 --> 00:42:48.359
data set right okay like how many um V

00:42:48.359 --> 00:42:50.240
how many rows are there how many columns

00:42:50.240 --> 00:42:51.760
are there what are the data types of the

00:42:51.760 --> 00:42:53.440
colums what are the name of the columns

00:42:53.440 --> 00:42:57.359
etc etc okay then we zoom in on to the

00:42:57.359 --> 00:42:58.839
Target variables so we look at the

00:42:58.839 --> 00:43:02.000
Target variables how many uh counts

00:43:02.000 --> 00:43:04.520
there are of this target variable uh and

00:43:04.520 --> 00:43:06.440
so on how many different types of

00:43:06.440 --> 00:43:08.240
failures there are then you want to

00:43:08.240 --> 00:43:09.000
check whether there are any

00:43:09.000 --> 00:43:10.760
inconsistencies between the Target and

00:43:10.760 --> 00:43:13.559
the failure type Etc okay so when you do

00:43:13.559 --> 00:43:15.119
all this checking you're going to

00:43:15.119 --> 00:43:16.960
discover there are some discrepancies in

00:43:16.960 --> 00:43:20.280
your data set so using a specific python

00:43:20.280 --> 00:43:21.839
code to do checking you're going to say

00:43:21.839 --> 00:43:23.480
hey you know what there's some errors

00:43:23.480 --> 00:43:25.000
here right there are nine values that

00:43:25.000 --> 00:43:26.599
classify as failure and Target variable

00:43:26.599 --> 00:43:28.200
but as no no failure in the failure type

00:43:28.200 --> 00:43:29.720
variable so that means there's a

00:43:29.720 --> 00:43:33.200
discrepancy in your data point right so

00:43:33.200 --> 00:43:34.760
which are so these are all the ones that

00:43:34.760 --> 00:43:36.359
are discrepancies because the target

00:43:36.359 --> 00:43:39.000
variable says one and we already know

00:43:39.000 --> 00:43:41.240
that Target variable one is supposed to

00:43:41.240 --> 00:43:43.240
mean that it's a failure right target

00:43:43.240 --> 00:43:44.880
varable one is supposed to mean that is

00:43:44.880 --> 00:43:47.119
a failure so we are kind of expecting to

00:43:47.119 --> 00:43:49.680
see the failure classification but some

00:43:49.680 --> 00:43:51.400
rows actually say there's no failure

00:43:51.400 --> 00:43:53.800
although the target type is one but here

00:43:53.800 --> 00:43:55.920
is a classic example of an error that

00:43:55.920 --> 00:43:58.640
can very well Ur in a data set so now

00:43:58.640 --> 00:44:00.559
the question is what do you do with

00:44:00.559 --> 00:44:04.720
these errors in your data set right so

00:44:04.720 --> 00:44:06.240
here the data scientist says I think it

00:44:06.240 --> 00:44:07.520
would make sense to remove those

00:44:07.520 --> 00:44:09.920
instances and so they write some code

00:44:09.920 --> 00:44:12.680
then to remove those instances or those

00:44:12.680 --> 00:44:14.920
uh rows or data points from the overall

00:44:14.920 --> 00:44:17.280
data set and same thing we can again

00:44:17.280 --> 00:44:19.240
check for other ISU so we find there's

00:44:19.240 --> 00:44:21.160
another ISU here with our data set which

00:44:21.160 --> 00:44:24.079
is another warning so again we can

00:44:24.079 --> 00:44:26.240
possibly remove them so you're going to

00:44:26.240 --> 00:44:31.280
remove 20 7 instances or rows from your

00:44:31.280 --> 00:44:34.440
overall data set so your data set has a

00:44:34.440 --> 00:44:37.079
10,000 uh rows or data points you're

00:44:37.079 --> 00:44:40.160
removing 27 which is only 0.27 of the

00:44:40.160 --> 00:44:42.240
entire data set and these were the

00:44:42.240 --> 00:44:45.720
reasons why you remove them okay so if

00:44:45.720 --> 00:44:48.160
you're just removing to uh 0.27% of the

00:44:48.160 --> 00:44:50.800
anti data set no big deal right still

00:44:50.800 --> 00:44:53.079
okay but you needed to remove them

00:44:53.079 --> 00:44:55.720
because these errors right this

00:44:55.720 --> 00:44:58.040
27 um

00:44:58.040 --> 00:45:00.559
errors okay data points with errors in

00:45:00.559 --> 00:45:02.960
your data set could really affect the

00:45:02.960 --> 00:45:05.000
training of your machine learning model

00:45:05.000 --> 00:45:08.640
so we need to do your data cleansing

00:45:08.640 --> 00:45:11.720
right so we are actually cleansing now

00:45:11.720 --> 00:45:15.200
uh uh some kind of data that is

00:45:15.200 --> 00:45:17.520
incorrect or erroneous in your original

00:45:17.520 --> 00:45:21.440
data set okay so then we go on to the

00:45:21.440 --> 00:45:23.839
next part which is called Eda right so

00:45:23.839 --> 00:45:28.880
Eda is where we kind of explore our data

00:45:28.880 --> 00:45:31.720
and we want to kind of get a visual

00:45:31.720 --> 00:45:34.240
overview of our data as a whole and also

00:45:34.240 --> 00:45:35.880
take a look at the statistical

00:45:35.880 --> 00:45:38.160
properties of data the statistical

00:45:38.160 --> 00:45:40.480
distribution of the data in all the

00:45:40.480 --> 00:45:43.079
various colums the correlation between

00:45:43.079 --> 00:45:44.640
the variables between the feature

00:45:44.640 --> 00:45:46.680
variables different columns and also the

00:45:46.680 --> 00:45:48.599
feature variable and the target variable

00:45:48.599 --> 00:45:52.040
so all of this is called Eda and Eda in

00:45:52.040 --> 00:45:54.079
a machine learning workflow is typically

00:45:54.079 --> 00:45:57.160
done through visualization

00:45:57.160 --> 00:45:58.839
all right so let's go back here and take

00:45:58.839 --> 00:46:00.599
a look right so for example here we are

00:46:00.599 --> 00:46:03.400
looking at correlation so we plot the

00:46:03.400 --> 00:46:05.680
values of all the various feature

00:46:05.680 --> 00:46:07.599
variables against each other and look

00:46:07.599 --> 00:46:10.800
for potential correlations and patterns

00:46:10.800 --> 00:46:13.359
and so on and all the different shapes

00:46:13.359 --> 00:46:17.280
that you see here in this pair plot okay

00:46:17.280 --> 00:46:18.400
uh will have different meaning

00:46:18.400 --> 00:46:20.000
statistical meaning and so the data

00:46:20.000 --> 00:46:21.800
scientist has to kind of visually

00:46:21.800 --> 00:46:23.760
inspect this P plot makes some

00:46:23.760 --> 00:46:25.559
interpretations of these different

00:46:25.559 --> 00:46:27.680
patterns that he sees here all right so

00:46:27.680 --> 00:46:30.480
these are some of the insights that that

00:46:30.480 --> 00:46:32.839
can be deduced from looking at these

00:46:32.839 --> 00:46:34.319
pattern so for example the Tor and

00:46:34.319 --> 00:46:36.280
rotational speed are highly correlated

00:46:36.280 --> 00:46:38.040
the process temperature and a

00:46:38.040 --> 00:46:39.920
temperature so highly correlated that

00:46:39.920 --> 00:46:41.559
failures occur for extreme values of

00:46:41.559 --> 00:46:44.520
some features etc etc then you can plot

00:46:44.520 --> 00:46:45.960
certain kinds of charts this called a

00:46:45.960 --> 00:46:48.480
violing chart to again get new insights

00:46:48.480 --> 00:46:49.839
for example regarding the talk and

00:46:49.839 --> 00:46:51.480
rotational speed it can see again that

00:46:51.480 --> 00:46:53.119
most failures are triggered for much

00:46:53.119 --> 00:46:55.119
lower or much higher values than the

00:46:55.119 --> 00:46:57.400
mean when they're not failing so all

00:46:57.400 --> 00:47:00.720
these visualizations they are there and

00:47:00.720 --> 00:47:02.480
a trained data scientist can look at

00:47:02.480 --> 00:47:05.079
them inspect them and make some kind of

00:47:05.079 --> 00:47:08.400
insightful deductions from them okay

00:47:08.400 --> 00:47:11.079
percentage of failure right uh the

00:47:11.079 --> 00:47:13.640
correlation heat map okay between all

00:47:13.640 --> 00:47:15.559
these different feature variables and

00:47:15.559 --> 00:47:16.920
also the target

00:47:16.920 --> 00:47:19.599
variable okay uh the product types

00:47:19.599 --> 00:47:21.079
percentage of product types percentage

00:47:21.079 --> 00:47:23.160
of failure with respect to the product

00:47:23.160 --> 00:47:25.720
type so we can also kind of visualize

00:47:25.720 --> 00:47:27.800
that as well so certain products have a

00:47:27.800 --> 00:47:29.839
higher ratio of faure compared to other

00:47:29.839 --> 00:47:33.240
product types Etc or for example uh M

00:47:33.240 --> 00:47:35.800
tends to feel more than H products etc

00:47:35.800 --> 00:47:38.880
etc so we can create a vast variety of

00:47:38.880 --> 00:47:41.319
visualizations in the Eda stage so you

00:47:41.319 --> 00:47:43.960
can see here and again the idea of this

00:47:43.960 --> 00:47:46.359
visualization is just to give us some

00:47:46.359 --> 00:47:49.680
insight some preliminary insight into

00:47:49.680 --> 00:47:52.520
our data set that helps us to model it

00:47:52.520 --> 00:47:54.119
more correctly so some more insights

00:47:54.119 --> 00:47:56.200
that we get into our data set from all

00:47:56.200 --> 00:47:57.599
this visualization

00:47:57.599 --> 00:47:59.559
then we can plot the distribution so we

00:47:59.559 --> 00:48:00.720
can see whether it's a normal

00:48:00.720 --> 00:48:03.079
distribution or some other kind of

00:48:03.079 --> 00:48:05.640
distribution uh we can have a box plot

00:48:05.640 --> 00:48:07.760
to see whether there are any outliers in

00:48:07.760 --> 00:48:10.400
your data set and so on right so we can

00:48:10.400 --> 00:48:11.640
see from the box plots we can see

00:48:11.640 --> 00:48:14.599
rotational speed and have outliers so we

00:48:14.599 --> 00:48:16.880
already saw outliers are basically a

00:48:16.880 --> 00:48:18.800
problem that you may need to kind of

00:48:18.800 --> 00:48:22.520
tackle right so outliers are an isue uh

00:48:22.520 --> 00:48:24.800
it's a it's a part of data cleansing and

00:48:24.800 --> 00:48:26.960
so you may need to tackle this so we may

00:48:26.960 --> 00:48:28.880
have to check okay well where are the

00:48:28.880 --> 00:48:31.319
potential outliers so we can analyze

00:48:31.319 --> 00:48:35.319
them from the box blot okay um but then

00:48:35.319 --> 00:48:37.079
we can say well they are outliers but

00:48:37.079 --> 00:48:38.800
maybe they're not really horrible

00:48:38.800 --> 00:48:40.760
outliers so we can tolerate them or

00:48:40.760 --> 00:48:42.880
maybe we want to remove them so we can

00:48:42.880 --> 00:48:44.920
see what the mean and maximum values for

00:48:44.920 --> 00:48:46.720
all these with respect to product type

00:48:46.720 --> 00:48:49.680
how many of them are above or highly

00:48:49.680 --> 00:48:51.440
correlated with the product type in

00:48:51.440 --> 00:48:54.240
terms of the maximum and minimum okay

00:48:54.240 --> 00:48:56.960
and then so on so the Insight is well we

00:48:56.960 --> 00:48:59.599
got 4.8% of the instances are outliers

00:48:59.599 --> 00:49:02.559
so maybe 4.87% is not really that much

00:49:02.559 --> 00:49:04.920
the outliers are not horrible so we just

00:49:04.920 --> 00:49:06.960
leave them in the data set now for a

00:49:06.960 --> 00:49:08.520
different data set the data scientist

00:49:08.520 --> 00:49:10.280
could come to different conclusion so

00:49:10.280 --> 00:49:12.280
then they would do whatever they've

00:49:12.280 --> 00:49:15.400
deemed is appropriate to kind of cleanse

00:49:15.400 --> 00:49:18.079
the data set okay so now that we have

00:49:18.079 --> 00:49:20.000
done all the Eda the next thing we're

00:49:20.000 --> 00:49:23.160
going to do is we are going to do what

00:49:23.160 --> 00:49:26.200
is called feature engineering so we are

00:49:26.200 --> 00:49:28.760
going to transform our original feature

00:49:28.760 --> 00:49:31.280
variables and these are our original

00:49:31.280 --> 00:49:32.960
feature variables right these are our

00:49:32.960 --> 00:49:35.040
original feature variables and we are

00:49:35.040 --> 00:49:37.760
going to transform them all right we're

00:49:37.760 --> 00:49:40.319
going to transform them in some sense uh

00:49:40.319 --> 00:49:43.760
into some other form before we fit this

00:49:43.760 --> 00:49:45.640
for training into our machine learning

00:49:45.640 --> 00:49:48.599
algorithm all right so these are

00:49:48.599 --> 00:49:51.599
examples of let's say this example of a

00:49:51.599 --> 00:49:55.200
original data set right and this is

00:49:55.200 --> 00:49:56.839
examples these are some of the examples

00:49:56.839 --> 00:49:58.040
you don't have to use all of them but

00:49:58.040 --> 00:49:59.440
these are some of examples of what we

00:49:59.440 --> 00:50:00.839
call feature engineering which you can

00:50:00.839 --> 00:50:03.559
then transform your original values in

00:50:03.559 --> 00:50:05.280
your feature variables to all these

00:50:05.280 --> 00:50:07.920
transform values here so we're going to

00:50:07.920 --> 00:50:09.680
pretty much do that here so we have a

00:50:09.680 --> 00:50:12.599
ordinal encoding we do scaling of the

00:50:12.599 --> 00:50:14.839
data so the data set is scaled we use a

00:50:14.839 --> 00:50:18.240
minmax scaling and then finally we come

00:50:18.240 --> 00:50:21.720
to do a modeling so we have to split our

00:50:21.720 --> 00:50:24.359
data set into a training data set and a

00:50:24.359 --> 00:50:28.640
test data set so coming back to again um

00:50:28.640 --> 00:50:32.160
we said that in a before you train your

00:50:32.160 --> 00:50:33.799
model sorry before you train your model

00:50:33.799 --> 00:50:35.599
you have to take your original data set

00:50:35.599 --> 00:50:37.319
now this is a featured engineered data

00:50:37.319 --> 00:50:38.839
set we're going to break it into two or

00:50:38.839 --> 00:50:40.839
more subsets okay so one is called the

00:50:40.839 --> 00:50:42.400
training data set that we use to Feit

00:50:42.400 --> 00:50:44.000
and train a machine learning model the

00:50:44.000 --> 00:50:45.920
second is test data set to evaluate the

00:50:45.920 --> 00:50:47.960
accuracy of the model okay so we got

00:50:47.960 --> 00:50:50.559
this training data set your test data

00:50:50.559 --> 00:50:52.720
set and we also need

00:50:52.720 --> 00:50:56.160
to sample so from our original data set

00:50:56.160 --> 00:50:57.400
we need to sample sample some points

00:50:57.400 --> 00:50:58.839
that go into your training data set some

00:50:58.839 --> 00:51:00.559
points that go in your test data set so

00:51:00.559 --> 00:51:02.720
there are many ways to do sampling one

00:51:02.720 --> 00:51:04.920
way is to do stratified sampling where

00:51:04.920 --> 00:51:06.720
we ensure the same proportion of data

00:51:06.720 --> 00:51:09.000
from each steta or class because right

00:51:09.000 --> 00:51:10.960
now we have a multiclass classification

00:51:10.960 --> 00:51:12.319
problem so you want to make sure the

00:51:12.319 --> 00:51:13.960
same proportion of data from each TR

00:51:13.960 --> 00:51:15.839
class is equally proportional in the

00:51:15.839 --> 00:51:17.920
training and test data set as the

00:51:17.920 --> 00:51:20.119
original data set which is very useful

00:51:20.119 --> 00:51:21.640
for dealing with what is called an

00:51:21.640 --> 00:51:24.319
imbalanced data set so here we have an

00:51:24.319 --> 00:51:25.839
example of what is called an imbalanced

00:51:25.839 --> 00:51:29.520
data set in the sense that you have the

00:51:29.520 --> 00:51:32.760
vast majority of data points in your

00:51:32.760 --> 00:51:34.960
data set they are going to have the

00:51:34.960 --> 00:51:37.480
value of zero for their target variable

00:51:37.480 --> 00:51:40.200
colume so only a extremely small

00:51:40.200 --> 00:51:43.119
minority of the data points in your data

00:51:43.119 --> 00:51:45.319
set will actually have the value of one

00:51:45.319 --> 00:51:48.720
for their target variable colume okay so

00:51:48.720 --> 00:51:51.040
a situation where you have your class or

00:51:51.040 --> 00:51:52.520
your target variable colume where the

00:51:52.520 --> 00:51:54.480
vast majority of values are from one

00:51:54.480 --> 00:51:58.119
class and a tiny small minority are from

00:51:58.119 --> 00:52:00.520
another class we call this an imbalanced

00:52:00.520 --> 00:52:02.720
data set and for an imbalanced data set

00:52:02.720 --> 00:52:04.319
typically we will have a specific

00:52:04.319 --> 00:52:05.920
technique to do the train test split

00:52:05.920 --> 00:52:08.119
which is called stratified sampling and

00:52:08.119 --> 00:52:09.599
so that's what's exactly happening here

00:52:09.599 --> 00:52:12.000
we're doing a stratified split here so

00:52:12.000 --> 00:52:14.839
we are doing a train test split here uh

00:52:14.839 --> 00:52:17.520
and we are doing a stratified split uh

00:52:17.520 --> 00:52:20.359
and then now we actually develop the

00:52:20.359 --> 00:52:23.359
models so now we've got the train test

00:52:23.359 --> 00:52:25.480
plate now here is where we actually

00:52:25.480 --> 00:52:27.079
train the models

00:52:27.079 --> 00:52:29.920
now in terms of classification there are

00:52:29.920 --> 00:52:32.319
a whole bunch of

00:52:32.319 --> 00:52:35.400
possibilities right that you can use

00:52:35.400 --> 00:52:38.480
there are many many different algorithms

00:52:38.480 --> 00:52:41.000
that we can use to create a

00:52:41.000 --> 00:52:42.839
classification model so this are an

00:52:42.839 --> 00:52:45.079
example of some of the more common ones

00:52:45.079 --> 00:52:47.480
logistic support Vector machine decision

00:52:47.480 --> 00:52:49.520
trees random Forest bagging balance

00:52:49.520 --> 00:52:52.720
bagging boost assemble Ensemble so all

00:52:52.720 --> 00:52:55.040
these are different algorithms which

00:52:55.040 --> 00:52:57.760
will create different kind of models

00:52:57.760 --> 00:53:01.599
which will result in different accuracy

00:53:01.599 --> 00:53:05.400
measures okay so it's the goal of the

00:53:05.400 --> 00:53:08.920
data scientist to find the best model

00:53:08.920 --> 00:53:11.520
that gives the best accuracy for the

00:53:11.520 --> 00:53:14.119
given data set for training on that

00:53:14.119 --> 00:53:16.880
given data set so let's head back again

00:53:16.880 --> 00:53:19.760
to uh our machine learning workflow so

00:53:19.760 --> 00:53:21.520
here basically what I'm doing is I'm

00:53:21.520 --> 00:53:23.520
creating a whole bunch of models here

00:53:23.520 --> 00:53:25.520
all right so one is a random Forest one

00:53:25.520 --> 00:53:27.160
is balance bagging one is a boost

00:53:27.160 --> 00:53:29.520
classifier one's The Ensemble classifier

00:53:29.520 --> 00:53:32.760
and using all of these I am going to

00:53:32.760 --> 00:53:35.319
basically Feit or train my model using

00:53:35.319 --> 00:53:37.440
all these algorithms and then I'm going

00:53:37.440 --> 00:53:39.799
to evaluate them okay I'm going to

00:53:39.799 --> 00:53:42.480
evaluate how good each of these models

00:53:42.480 --> 00:53:45.760
are and here you can see your value your

00:53:45.760 --> 00:53:48.839
evaluation data right okay and this is

00:53:48.839 --> 00:53:50.839
the confusion Matrix which is another

00:53:50.839 --> 00:53:54.280
way of evaluating so now we come to the

00:53:54.280 --> 00:53:56.319
kind of the the the key part here which

00:53:56.319 --> 00:53:58.520
is which is how do I distinguish between

00:53:58.520 --> 00:54:00.079
all these models right I've got all

00:54:00.079 --> 00:54:01.400
these different models which are built

00:54:01.400 --> 00:54:03.040
with different algorithms which I'm

00:54:03.040 --> 00:54:05.359
using to train on the same data set how

00:54:05.359 --> 00:54:07.359
do I distinguish between all these

00:54:07.359 --> 00:54:10.359
models okay and so for that sense for

00:54:10.359 --> 00:54:13.880
that we actually have a whole bunch of

00:54:13.880 --> 00:54:16.200
common evaluation matrics for

00:54:16.200 --> 00:54:18.319
classification right so this evaluation

00:54:18.319 --> 00:54:22.240
matrics tell us how good a model is in

00:54:22.240 --> 00:54:24.319
terms of its accuracy in

00:54:24.319 --> 00:54:27.000
classification so in terms of

00:54:27.000 --> 00:54:29.440
accuracy we actually have many different

00:54:29.440 --> 00:54:31.680
models uh sorry many different measures

00:54:31.680 --> 00:54:33.440
right you might think well accuracy is

00:54:33.440 --> 00:54:35.400
just accuracy well that's all right it's

00:54:35.400 --> 00:54:36.880
just either it's accurate or it's not

00:54:36.880 --> 00:54:39.319
accurate right but actually it's not

00:54:39.319 --> 00:54:41.359
that simple there are many different

00:54:41.359 --> 00:54:43.839
ways to measure the accuracy of a

00:54:43.839 --> 00:54:45.480
classification model and these are some

00:54:45.480 --> 00:54:48.280
of the more common ones so for example

00:54:48.280 --> 00:54:51.000
the confusion metrix tells us how many

00:54:51.000 --> 00:54:54.000
true positives that means the value is

00:54:54.000 --> 00:54:55.880
positive the prediction is positive how

00:54:55.880 --> 00:54:57.520
many false FAL positives which means the

00:54:57.520 --> 00:54:59.040
value is negative the machine learning

00:54:59.040 --> 00:55:01.839
model predicts positive how many false

00:55:01.839 --> 00:55:03.839
negatives which means that the machine

00:55:03.839 --> 00:55:05.559
learning model predicts negative but

00:55:05.559 --> 00:55:07.480
it's actually positive and how many true

00:55:07.480 --> 00:55:09.359
negatives there are which means that the

00:55:09.359 --> 00:55:11.240
machine the machine learning model

00:55:11.240 --> 00:55:12.880
predicts negative and the true value is

00:55:12.880 --> 00:55:14.760
also negative so this is called a

00:55:14.760 --> 00:55:16.920
confusion Matrix this is one way we

00:55:16.920 --> 00:55:19.480
assess or evaluate the performance of a

00:55:19.480 --> 00:55:20.520
classification

00:55:20.520 --> 00:55:23.319
model okay this is for binary

00:55:23.319 --> 00:55:24.680
classification we can also have

00:55:24.680 --> 00:55:26.880
multiclass confusion Matrix

00:55:26.880 --> 00:55:29.000
and then we can also measure things like

00:55:29.000 --> 00:55:31.720
accuracy so accuracy is the true

00:55:31.720 --> 00:55:34.079
positives plus the true negatives which

00:55:34.079 --> 00:55:35.440
is the total number of correct

00:55:35.440 --> 00:55:37.839
predictions made by the model divided by

00:55:37.839 --> 00:55:39.839
the total number of data points in your

00:55:39.839 --> 00:55:42.599
data set and then you have also other

00:55:42.599 --> 00:55:43.720
kinds of

00:55:43.720 --> 00:55:46.599
measures uh such as recall and this is a

00:55:46.599 --> 00:55:49.160
formula for recall this is a formula for

00:55:49.160 --> 00:55:51.480
the F1 score okay and then there's

00:55:51.480 --> 00:55:55.559
something called the uh R curve right so

00:55:55.559 --> 00:55:57.039
without going too much in the detail of

00:55:57.039 --> 00:55:59.000
what each of these entails essentially

00:55:59.000 --> 00:56:00.640
these are all different ways these are

00:56:00.640 --> 00:56:03.280
different kpi right just like if you

00:56:03.280 --> 00:56:06.119
work in a company you have different kpi

00:56:06.119 --> 00:56:08.079
right certain employees have certain kpi

00:56:08.079 --> 00:56:11.280
that measures how good or how how uh you

00:56:11.280 --> 00:56:13.200
know efficient or how effective a

00:56:13.200 --> 00:56:16.240
particular employee is right so the

00:56:16.240 --> 00:56:19.880
kpi kpi for your machine learning models

00:56:19.880 --> 00:56:24.240
are Roc curve F1 score recall accuracy

00:56:24.240 --> 00:56:26.599
okay and your confusion Matrix so so

00:56:26.599 --> 00:56:29.839
fundamentally after I have built right

00:56:29.839 --> 00:56:33.359
so here I've built my four different

00:56:33.359 --> 00:56:35.240
models so after I built these form

00:56:35.240 --> 00:56:37.640
different models I'm going to check and

00:56:37.640 --> 00:56:39.680
evaluate them using all those different

00:56:39.680 --> 00:56:42.440
metrics like for example the F1 score

00:56:42.440 --> 00:56:44.839
the Precision score the recall score all

00:56:44.839 --> 00:56:47.319
right so for this model I can check out

00:56:47.319 --> 00:56:50.039
the ROC score the F1 score the Precision

00:56:50.039 --> 00:56:52.119
score the recall score then for this

00:56:52.119 --> 00:56:54.799
model this is the ROC score the F1 score

00:56:54.799 --> 00:56:56.839
the Precision score the recall called

00:56:56.839 --> 00:56:59.680
then for this model and so on so for

00:56:59.680 --> 00:57:03.240
every single model I've created using my

00:57:03.240 --> 00:57:05.839
training data set I will have all my set

00:57:05.839 --> 00:57:08.000
of evaluation metrics that I can use to

00:57:08.000 --> 00:57:11.839
evaluate how good this model is okay

00:57:11.839 --> 00:57:13.119
same thing here I've got a confusion

00:57:13.119 --> 00:57:15.079
Matrix here right so I can use that

00:57:15.079 --> 00:57:18.119
again to evaluate between all these four

00:57:18.119 --> 00:57:20.200
different models and then I kind of

00:57:20.200 --> 00:57:22.240
summarize it up here so we can see from

00:57:22.240 --> 00:57:25.440
this summary here that actually the top

00:57:25.440 --> 00:57:27.599
two models right which are I'm going to

00:57:27.599 --> 00:57:29.440
give a lot as a data scientist I'm now

00:57:29.440 --> 00:57:31.119
going to just focus on these two models

00:57:31.119 --> 00:57:33.440
so these two models are begging

00:57:33.440 --> 00:57:36.000
classifier and random Forest classifier

00:57:36.000 --> 00:57:38.480
they have the highest values of F1 score

00:57:38.480 --> 00:57:40.480
and the highest values of the rooc curve

00:57:40.480 --> 00:57:42.640
score okay so we can say these are the

00:57:42.640 --> 00:57:45.839
top two models in terms of accuracy okay

00:57:45.839 --> 00:57:48.920
using the fub1 evaluation metric and the

00:57:48.920 --> 00:57:53.720
r Au evaluation metric okay so these

00:57:53.720 --> 00:57:57.480
results uh kind of summarize here and

00:57:57.480 --> 00:57:59.079
then we use different sampling

00:57:59.079 --> 00:58:00.880
techniques okay so just now I talked

00:58:00.880 --> 00:58:03.680
about um different kinds of sampling

00:58:03.680 --> 00:58:06.400
techniques and so the idea of different

00:58:06.400 --> 00:58:08.319
kinds of sampling techniques is to just

00:58:08.319 --> 00:58:11.319
get a different feel for different

00:58:11.319 --> 00:58:13.720
distributions of the data in different

00:58:13.720 --> 00:58:16.359
areas of your data set so that you want

00:58:16.359 --> 00:58:20.000
to just kind of make sure that your your

00:58:20.000 --> 00:58:22.799
your evaluation of accuracy is actually

00:58:22.799 --> 00:58:27.079
statistically correct right so we can um

00:58:27.079 --> 00:58:29.599
do what is called oversampling and under

00:58:29.599 --> 00:58:30.880
sampling which is very useful when

00:58:30.880 --> 00:58:32.280
you're working with an imbalance data

00:58:32.280 --> 00:58:35.039
set so this is example of doing that and

00:58:35.039 --> 00:58:37.240
then here we again again check out the

00:58:37.240 --> 00:58:38.799
results for all these different

00:58:38.799 --> 00:58:41.680
techniques we use uh the F1 score the Au

00:58:41.680 --> 00:58:43.599
score all right these are the two key

00:58:43.599 --> 00:58:46.760
measures of accuracy right so and then

00:58:46.760 --> 00:58:47.920
we can check out the scores for the

00:58:47.920 --> 00:58:50.480
different approaches okay so we can see

00:58:50.480 --> 00:58:53.119
oh well overall the models have lower Au

00:58:53.119 --> 00:58:55.720
r r Au C score but they have a much

00:58:55.720 --> 00:58:58.280
higher F1 score the begging classifier

00:58:58.280 --> 00:59:00.839
had the highest R1 highest roc1 score

00:59:00.839 --> 00:59:04.119
but F1 score was too low okay then in

00:59:04.119 --> 00:59:06.520
the data scientist opinion the random

00:59:06.520 --> 00:59:08.520
forest with this particular technique of

00:59:08.520 --> 00:59:10.760
sampling has equilibrium between the F1

00:59:10.760 --> 00:59:14.480
R F1 R and A score so the takeaway one

00:59:14.480 --> 00:59:16.680
is the macro F1 score improves

00:59:16.680 --> 00:59:18.480
dramatically using the sampl sampling

00:59:18.480 --> 00:59:20.160
techniqu so these models might be better

00:59:20.160 --> 00:59:22.440
compared to the balanced ones all right

00:59:22.440 --> 00:59:26.280
so based on all this uh evaluation the

00:59:26.280 --> 00:59:27.680
data scientist says they're going to

00:59:27.680 --> 00:59:29.920
continue to work with these two models

00:59:29.920 --> 00:59:31.440
all right and the balance begging one

00:59:31.440 --> 00:59:33.079
and then continue to make further

00:59:33.079 --> 00:59:35.039
comparisons all right so then we

00:59:35.039 --> 00:59:37.079
continue to keep refining on our

00:59:37.079 --> 00:59:38.599
evaluation work here we're going to

00:59:38.599 --> 00:59:41.000
train the models one more time again so

00:59:41.000 --> 00:59:43.039
we again do a training test plate and

00:59:43.039 --> 00:59:44.799
then we do that for this particular uh

00:59:44.799 --> 00:59:47.039
approach model and then we print out we

00:59:47.039 --> 00:59:48.200
print out what is called a

00:59:48.200 --> 00:59:50.960
classification report and this is

00:59:50.960 --> 00:59:53.400
basically a summary of all those metrics

00:59:53.400 --> 00:59:55.359
that I talk about just now so just now

00:59:55.359 --> 00:59:57.520
remember I said the the there was

00:59:57.520 --> 00:59:59.680
several evaluation metrics right so uh

00:59:59.680 --> 01:00:01.480
we had the confusion matrics the

01:00:01.480 --> 01:00:04.119
accuracy the Precision the recall the Au

01:00:04.119 --> 01:00:08.119
ccore so here with the um classification

01:00:08.119 --> 01:00:09.880
report I can get a summary of all of

01:00:09.880 --> 01:00:11.760
that so I can see all the values here

01:00:11.760 --> 01:00:14.640
okay for this particular model begging

01:00:14.640 --> 01:00:17.160
Tomac links and then I can do that for

01:00:17.160 --> 01:00:18.640
another model the random Forest

01:00:18.640 --> 01:00:20.599
borderline SME and then I can do that

01:00:20.599 --> 01:00:22.200
for another model which is the balance

01:00:22.200 --> 01:00:25.160
ping so again we see this a lot of

01:00:25.160 --> 01:00:27.079
comparison between different models

01:00:27.079 --> 01:00:28.640
trying to figure out what all these

01:00:28.640 --> 01:00:30.720
evaluation metrics are telling us all

01:00:30.720 --> 01:00:32.960
right then again we have a confusion

01:00:32.960 --> 01:00:35.880
Matrix so we generate a confusion Matrix

01:00:35.880 --> 01:00:38.880
for the bagging with the toac links

01:00:38.880 --> 01:00:40.720
under sampling for the random followers

01:00:40.720 --> 01:00:42.680
with the borderline mod over sampling

01:00:42.680 --> 01:00:44.960
and just balance begging by itself then

01:00:44.960 --> 01:00:47.720
again we compare between these three uh

01:00:47.720 --> 01:00:50.799
models uh using the confusion Matrix

01:00:50.799 --> 01:00:52.599
evaluation Matrix and then we can kind

01:00:52.599 --> 01:00:55.680
of come to some conclusions all right so

01:00:55.680 --> 01:00:58.160
right so now we look at all the data

01:00:58.160 --> 01:01:01.200
then we move on and look at another um

01:01:01.200 --> 01:01:03.160
another kind of evaluation metrix which

01:01:03.160 --> 01:01:06.720
is the r score right so this is one of

01:01:06.720 --> 01:01:08.680
the other evaluation metrics I talk

01:01:08.680 --> 01:01:11.200
about so this one is a kind of a curve

01:01:11.200 --> 01:01:12.520
you look at it to see the area

01:01:12.520 --> 01:01:14.359
underneath the curve this is called AOC

01:01:14.359 --> 01:01:18.079
R area under the curve sorry Au Au R

01:01:18.079 --> 01:01:19.880
area under the curve all right so the

01:01:19.880 --> 01:01:21.839
area under the curve uh

01:01:21.839 --> 01:01:24.319
score will give us some idea about the

01:01:24.319 --> 01:01:25.599
threshold that we're going to use for

01:01:25.599 --> 01:01:27.680
classif ification so we can examine this

01:01:27.680 --> 01:01:29.200
for the bagging classifier for the

01:01:29.200 --> 01:01:30.960
random forest classifier for the balance

01:01:30.960 --> 01:01:33.599
bagging classifier okay then we can also

01:01:33.599 --> 01:01:36.200
again do that uh finally we can check

01:01:36.200 --> 01:01:37.880
the classification report of this

01:01:37.880 --> 01:01:39.680
particular model so we keep doing this

01:01:39.680 --> 01:01:43.200
over and over again evaluating this m

01:01:43.200 --> 01:01:45.720
The Matrix the the accuracy Matrix the

01:01:45.720 --> 01:01:46.880
evaluation Matrix for all these

01:01:46.880 --> 01:01:48.880
different models so we keep doing this

01:01:48.880 --> 01:01:50.520
over and over again for different

01:01:50.520 --> 01:01:53.440
thresholds or for classification and so

01:01:53.440 --> 01:01:56.880
as we keep drilling into these we kind

01:01:56.880 --> 01:02:00.839
of get more and more understanding of

01:02:00.839 --> 01:02:02.799
all these different models which one is

01:02:02.799 --> 01:02:04.760
the best one that gives the best

01:02:04.760 --> 01:02:08.520
performance for our data set okay so

01:02:08.520 --> 01:02:11.440
finally we come to this conclusion this

01:02:11.440 --> 01:02:13.520
particular model is not able to reduce

01:02:13.520 --> 01:02:15.279
the record on failure test than

01:02:15.279 --> 01:02:17.520
95.8% on the other hand balance begging

01:02:17.520 --> 01:02:19.400
with a decision thresold of 0.6 is able

01:02:19.400 --> 01:02:21.520
to have a better recall blah blah blah

01:02:21.520 --> 01:02:25.319
Etc so finally after having done all of

01:02:25.319 --> 01:02:27.480
this evalu ations

01:02:27.480 --> 01:02:31.119
okay this is the conclusion

01:02:31.119 --> 01:02:33.960
so after having gone so right now we

01:02:33.960 --> 01:02:35.279
have gone through all the steps of the

01:02:35.279 --> 01:02:37.760
Machining learning life cycle and which

01:02:37.760 --> 01:02:40.240
means we have right now or the data

01:02:40.240 --> 01:02:41.960
scientist right now has gone through all

01:02:41.960 --> 01:02:43.000
these

01:02:43.000 --> 01:02:47.079
steps uh which is now we have done this

01:02:47.079 --> 01:02:48.640
validation so we have done the cleaning

01:02:48.640 --> 01:02:50.559
exploration preparation transformation

01:02:50.559 --> 01:02:52.599
the future engineering we have developed

01:02:52.599 --> 01:02:54.359
and trained multiple models we have

01:02:54.359 --> 01:02:56.480
evaluated all these different models so

01:02:56.480 --> 01:02:58.599
right now we have reached this stage so

01:02:58.599 --> 01:03:02.720
at this stage we as the data scientist

01:03:02.720 --> 01:03:05.480
kind of have completed our job so we've

01:03:05.480 --> 01:03:08.119
come to some very useful conclusions

01:03:08.119 --> 01:03:09.640
which we now can share with our

01:03:09.640 --> 01:03:13.240
colleagues all right and based on this

01:03:13.240 --> 01:03:15.400
uh conclusions or recommendations

01:03:15.400 --> 01:03:17.160
somebody is going to choose a

01:03:17.160 --> 01:03:19.160
appropriate model and that model is

01:03:19.160 --> 01:03:22.640
going to get deployed for realtime use

01:03:22.640 --> 01:03:25.319
in a real life production environment

01:03:25.319 --> 01:03:27.240
okay and that decision is going to be

01:03:27.240 --> 01:03:29.359
made based on the recommendations coming

01:03:29.359 --> 01:03:30.880
from the data scientist at the end of

01:03:30.880 --> 01:03:33.480
this phase okay so at the end of this

01:03:33.480 --> 01:03:35.079
phase the data scientist is going to

01:03:35.079 --> 01:03:36.880
come up with these conclusions so

01:03:36.880 --> 01:03:41.760
conclusions is okay if the engineering

01:03:41.760 --> 01:03:44.520
team they are looking okay the

01:03:44.520 --> 01:03:46.119
engineering team right the engineering

01:03:46.119 --> 01:03:48.720
team if they are looking for the highest

01:03:48.720 --> 01:03:51.839
failure detection rate possible then

01:03:51.839 --> 01:03:54.480
they should go with this particular

01:03:54.480 --> 01:03:56.520
model okay

01:03:56.520 --> 01:03:58.680
and if they want a balance between

01:03:58.680 --> 01:04:01.039
precision and recall then they should

01:04:01.039 --> 01:04:03.240
choose between the begging model with a

01:04:03.240 --> 01:04:05.960
0.4 decision threshold or the random

01:04:05.960 --> 01:04:09.599
forest model with a 0.5 threshold but if

01:04:09.599 --> 01:04:11.880
they don't care so much about predicting

01:04:11.880 --> 01:04:14.480
every failure and they want the highest

01:04:14.480 --> 01:04:16.760
Precision possible then they should opt

01:04:16.760 --> 01:04:19.799
for the begging toax link classifier

01:04:19.799 --> 01:04:23.160
with a bit higher decision threshold and

01:04:23.160 --> 01:04:26.160
so this is the key thing that the data

01:04:26.160 --> 01:04:28.319
scientist is going to give right this is

01:04:28.319 --> 01:04:30.760
the key takeaway this is the kind of the

01:04:30.760 --> 01:04:32.680
end result of the entire machine

01:04:32.680 --> 01:04:34.680
learning life cycle right now the data

01:04:34.680 --> 01:04:36.400
scientist is going to tell the

01:04:36.400 --> 01:04:38.599
engineering team all right you guys

01:04:38.599 --> 01:04:41.160
which is more important for you point a

01:04:41.160 --> 01:04:45.039
point B or Point C make your decision so

01:04:45.039 --> 01:04:47.400
the engineering team will then discuss

01:04:47.400 --> 01:04:48.960
among themselves and say hey you know

01:04:48.960 --> 01:04:52.279
what what we want is we want to get the

01:04:52.279 --> 01:04:54.720
highest failure detection possible

01:04:54.720 --> 01:04:58.359
because any kind kind of failure of that

01:04:58.359 --> 01:05:00.400
machine or the product on the samply

01:05:00.400 --> 01:05:03.119
line is really going to screw us up big

01:05:03.119 --> 01:05:05.640
time so what we're looking for is the

01:05:05.640 --> 01:05:08.079
model that will give us the highest

01:05:08.079 --> 01:05:10.880
failure detection rate we don't care

01:05:10.880 --> 01:05:13.480
about Precision but we want to be make

01:05:13.480 --> 01:05:15.440
sure that if there's a failure we are

01:05:15.440 --> 01:05:17.720
going to catch it right so that's what

01:05:17.720 --> 01:05:19.599
they want and so the data scientist will

01:05:19.599 --> 01:05:22.200
say Hey you go for the balance begging

01:05:22.200 --> 01:05:24.880
model okay then the data scientist saves

01:05:24.880 --> 01:05:27.720
this all right uh and then once you have

01:05:27.720 --> 01:05:30.000
saved this uh you can then go right

01:05:30.000 --> 01:05:32.319
ahead and deploy that so you can go

01:05:32.319 --> 01:05:33.520
right ahead and deploy that to

01:05:33.520 --> 01:05:37.160
production okay and so if you want to

01:05:37.160 --> 01:05:38.839
continue we can actually further

01:05:38.839 --> 01:05:41.119
continue this modeling problem so just

01:05:41.119 --> 01:05:43.480
now I model this problem as a binary

01:05:43.480 --> 01:05:46.720
classification problem uh sorry just I

01:05:46.720 --> 01:05:48.240
modeled this problem as a binary

01:05:48.240 --> 01:05:49.520
classification which means it's either

01:05:49.520 --> 01:05:51.680
zero or one either fail or not fail but

01:05:51.680 --> 01:05:53.599
we can also model it as a multiclass

01:05:53.599 --> 01:05:55.640
classification problem right because as

01:05:55.640 --> 01:05:57.640
as I said earlier just now for the

01:05:57.640 --> 01:06:00.200
Target variable colum which is sorry for

01:06:00.200 --> 01:06:02.520
the failure type colume you actually

01:06:02.520 --> 01:06:04.839
have multiple kinds of failures right

01:06:04.839 --> 01:06:07.559
for example you may have a power failure

01:06:07.559 --> 01:06:10.000
uh you may have a towar failure uh you

01:06:10.000 --> 01:06:12.920
may have a overstrain failure so now we

01:06:12.920 --> 01:06:14.839
can model the problem slightly

01:06:14.839 --> 01:06:17.240
differently so we can model it as a

01:06:17.240 --> 01:06:19.680
multiclass classification problem and

01:06:19.680 --> 01:06:21.160
then we go through the entire same

01:06:21.160 --> 01:06:22.680
process that we went through just now so

01:06:22.680 --> 01:06:24.880
we create different models we test this

01:06:24.880 --> 01:06:26.720
out but now the confusion Matrix is for

01:06:26.720 --> 01:06:30.119
a multiclass classification isue right

01:06:30.119 --> 01:06:30.960
so we're going

01:06:30.960 --> 01:06:34.039
to check them out we're going to again

01:06:34.039 --> 01:06:36.079
uh try different algorithms or models

01:06:36.079 --> 01:06:38.039
again train and test our data set do the

01:06:38.039 --> 01:06:39.760
training test split uh on these

01:06:39.760 --> 01:06:42.000
different models all right so we have

01:06:42.000 --> 01:06:43.400
like for example we have bon random

01:06:43.400 --> 01:06:46.160
Forest B random Forest a great search

01:06:46.160 --> 01:06:47.720
then you train the models using what is

01:06:47.720 --> 01:06:49.680
called hyperparameter tuning then you

01:06:49.680 --> 01:06:51.079
get the scores all right so you get the

01:06:51.079 --> 01:06:53.160
same evaluation scores again you check

01:06:53.160 --> 01:06:54.599
out the evaluation scores compare

01:06:54.599 --> 01:06:57.079
between them generate a confusion Matrix

01:06:57.079 --> 01:06:59.960
so this is a multiclass confusion Matrix

01:06:59.960 --> 01:07:02.400
and then you come to the final

01:07:02.400 --> 01:07:05.760
conclusion so now if you are interested

01:07:05.760 --> 01:07:09.000
to frame your problem domain as a

01:07:09.000 --> 01:07:11.359
multiclass classification problem all

01:07:11.359 --> 01:07:13.839
right then these are the recommendations

01:07:13.839 --> 01:07:15.480
from the data scientist so the data

01:07:15.480 --> 01:07:17.240
scientist will say you know what I'm

01:07:17.240 --> 01:07:19.559
going to pick this particular model the

01:07:19.559 --> 01:07:22.039
balance backing classifier and these are

01:07:22.039 --> 01:07:24.520
all the reasons that the data scientist

01:07:24.520 --> 01:07:27.279
is going to give as a rational for

01:07:27.279 --> 01:07:29.400
selecting this particular

01:07:29.400 --> 01:07:32.039
model and then once that's done you save

01:07:32.039 --> 01:07:35.000
the model and that's that's it that's it

01:07:35.000 --> 01:07:38.920
so that's all done now and so then the

01:07:38.920 --> 01:07:41.039
uh the model the machine learning model

01:07:41.039 --> 01:07:43.720
now you can put it live run it on the

01:07:43.720 --> 01:07:45.279
server and now the machine learning

01:07:45.279 --> 01:07:47.200
model is ready to work which means it's

01:07:47.200 --> 01:07:48.920
ready to generate predictions right

01:07:48.920 --> 01:07:50.279
that's the main job of the machine

01:07:50.279 --> 01:07:52.039
learning model you have picked the best

01:07:52.039 --> 01:07:53.680
machine learning model with the best

01:07:53.680 --> 01:07:55.799
evaluation metrics for whatever accur

01:07:55.799 --> 01:07:57.760
see goal you're trying to achieve and

01:07:57.760 --> 01:07:59.640
now you're going to run it on a server

01:07:59.640 --> 01:08:00.799
and now you're going to get all this

01:08:00.799 --> 01:08:02.960
real time data that's coming from your

01:08:02.960 --> 01:08:04.520
sensus you're going to pump that into

01:08:04.520 --> 01:08:06.359
your machine learning model your machine

01:08:06.359 --> 01:08:07.880
learning model will pump out a whole

01:08:07.880 --> 01:08:09.520
bunch of predictions and we're going to

01:08:09.520 --> 01:08:12.799
use that predictions in real time to

01:08:12.799 --> 01:08:15.400
make real time real world decision

01:08:15.400 --> 01:08:17.560
making right you're going to say okay

01:08:17.560 --> 01:08:19.600
I'm predicting that that machine is

01:08:19.600 --> 01:08:23.198
going to fail on Thursday at 5:00 p.m.

01:08:23.198 --> 01:08:25.520
so you better get your service folks in

01:08:25.520 --> 01:08:28.640
to service it on Thursday 2: p.m. or you

01:08:28.640 --> 01:08:31.640
know whatever so you can you know uh

01:08:31.640 --> 01:08:33.479
make decisions on when you want to do

01:08:33.479 --> 01:08:35.319
your maintenance you know and and make

01:08:35.319 --> 01:08:37.640
the best decisions to optimize the cost

01:08:37.640 --> 01:08:41.158
of Maintenance etc etc and then based on

01:08:41.158 --> 01:08:42.120
the

01:08:42.120 --> 01:08:45.000
results that are coming up from the

01:08:45.000 --> 01:08:46.759
predictions so the predictions may be

01:08:46.759 --> 01:08:49.120
good the predictions may be lousy the

01:08:49.120 --> 01:08:51.359
predictions may be average right so we

01:08:51.359 --> 01:08:53.719
are we're constantly monitoring how good

01:08:53.719 --> 01:08:55.439
or how useful are the predictions

01:08:55.439 --> 01:08:57.759
generated by this realtime model that's

01:08:57.759 --> 01:08:59.880
running on the server and based on our

01:08:59.880 --> 01:09:02.679
monitoring we will then take some new

01:09:02.679 --> 01:09:05.319
data and then repeat this entire life

01:09:05.319 --> 01:09:07.040
cycle again so this is basically a

01:09:07.040 --> 01:09:09.238
workflow that's iterative and we are

01:09:09.238 --> 01:09:11.120
constantly or the data scientist is

01:09:11.120 --> 01:09:13.319
constantly getting in all these new data

01:09:13.319 --> 01:09:15.279
points and then refining the model

01:09:15.279 --> 01:09:17.960
picking maybe a new model deploying the

01:09:17.960 --> 01:09:21.679
new model onto the server and so on all

01:09:21.679 --> 01:09:23.920
right and so that's it so that is

01:09:23.920 --> 01:09:26.399
basically your machine learning workflow

01:09:26.399 --> 01:09:29.479
in a nutshell okay so for this

01:09:29.479 --> 01:09:32.080
particular approach we have used a bunch

01:09:32.080 --> 01:09:34.560
of uh data science libraries from python

01:09:34.560 --> 01:09:36.520
so we have used pandas which is the most

01:09:36.520 --> 01:09:38.560
B basic data science libraries that

01:09:38.560 --> 01:09:40.279
provides all the tools to work with raw

01:09:40.279 --> 01:09:42.520
data we have used numai which is a high

01:09:42.520 --> 01:09:44.080
performance library for implementing

01:09:44.080 --> 01:09:46.439
complex array metrix operations we have

01:09:46.439 --> 01:09:49.560
used met plot lip and cbon which is used

01:09:49.560 --> 01:09:52.439
for doing the Eda the explorat

01:09:52.439 --> 01:09:55.560
exploratory data analysis phase machine

01:09:55.560 --> 01:09:57.040
learning where you visualize all your

01:09:57.040 --> 01:09:59.040
data we have used psyit learn which is

01:09:59.040 --> 01:10:01.280
the machine L learning library to do all

01:10:01.280 --> 01:10:02.920
your implementation for all your call

01:10:02.920 --> 01:10:06.000
machine learning algorithms uh we we we

01:10:06.000 --> 01:10:08.000
have not used this because this is not a

01:10:08.000 --> 01:10:11.040
deep learning uh problem but if you are

01:10:11.040 --> 01:10:12.800
working with a deep learning problem

01:10:12.800 --> 01:10:15.360
like image classification image

01:10:15.360 --> 01:10:17.840
recognition object detection okay

01:10:17.840 --> 01:10:20.199
natural language processing text

01:10:20.199 --> 01:10:21.920
classification well then you're going to

01:10:21.920 --> 01:10:24.360
use these libraries from python which is

01:10:24.360 --> 01:10:28.960
tensor flow okay and also py

01:10:28.960 --> 01:10:32.679
to and then lastly that whole thing that

01:10:32.679 --> 01:10:34.719
whole data science project that you saw

01:10:34.719 --> 01:10:36.800
just now this entire data science

01:10:36.800 --> 01:10:38.880
project is actually developed in

01:10:38.880 --> 01:10:41.080
something called a Jupiter notebook so

01:10:41.080 --> 01:10:44.040
all this python code along with all the

01:10:44.040 --> 01:10:46.360
observations from the data

01:10:46.360 --> 01:10:48.679
scientists okay for this entire data

01:10:48.679 --> 01:10:50.440
science project was actually run in

01:10:50.440 --> 01:10:53.360
something called a Jupiter notebook so

01:10:53.360 --> 01:10:55.760
that is uh the

01:10:55.760 --> 01:10:59.080
most widely used tool for interactively

01:10:59.080 --> 01:11:02.360
developing and presenting data science

01:11:02.360 --> 01:11:04.640
projects okay so that brings me to the

01:11:04.640 --> 01:11:07.400
end of this entire presentation I hope

01:11:07.400 --> 01:11:10.360
that you find it useful for you and that

01:11:10.360 --> 01:11:13.199
you can appreciate the importance of

01:11:13.199 --> 01:11:15.280
machine learning and how it can be

01:11:15.280 --> 01:11:19.800
applied in a real life use case in a

01:11:19.800 --> 01:11:23.360
typical production environment all right

01:11:23.360 --> 01:11:27.239
thank you all so much for watching