1
00:00:01,199 --> 00:00:03,760
Hello everyone, my name is Victor. I'm

2
00:00:03,760 --> 00:00:05,359
your friendly neighborhood data

3
00:00:05,359 --> 00:00:07,759
scientist from DreamCatcher. So in this

4
00:00:07,759 --> 00:00:10,160
presentation, I would like to talk about

5
00:00:10,160 --> 00:00:12,759
a specific industry use case of AI or

6
00:00:12,759 --> 00:00:15,069
machine learning which is predictive

7
00:00:15,069 --> 00:00:19,000
maintenance. So I will be covering these

8
00:00:19,000 --> 00:00:21,320
topics and feel free to jump forward to

9
00:00:21,320 --> 00:00:23,359
the specific part in the video where I

10
00:00:23,359 --> 00:00:25,160
talk about all these topics. So I'm going

11
00:00:25,160 --> 00:00:27,160
to start off with a general preview of

12
00:00:27,160 --> 00:00:29,080
AI and machine learning. Then, I'll

13
00:00:29,080 --> 00:00:30,840
discuss the use case which is predictive

14
00:00:30,840 --> 00:00:32,719
maintenance. I'll talk about the basics

15
00:00:32,719 --> 00:00:34,800
of machine learning, the workflow of

16
00:00:34,800 --> 00:00:37,239
machine learning, and then we will come

17
00:00:37,239 --> 00:00:40,760
to the meat of this presentation which

18
00:00:40,760 --> 00:00:43,680
is essentially a demonstration of the

19
00:00:43,680 --> 00:00:45,399
machine learning workflow from end to

20
00:00:45,399 --> 00:00:47,579
end on a real life predictive

21
00:00:47,579 --> 00:00:51,520
maintenance domain problem. All right, so

22
00:00:51,520 --> 00:00:53,640
without any further ado, let's jump into

23
00:00:53,640 --> 00:00:56,680
it. So let's start off with a quick

24
00:00:56,680 --> 00:01:00,079
preview of AI and machine learning. Well

25
00:01:00,079 --> 00:01:03,600
AI is a very general term, it encompasses

26
00:01:03,600 --> 00:01:06,680
the entire area of science and

27
00:01:06,680 --> 00:01:09,040
engineering that is related to creating

28
00:01:09,040 --> 00:01:10,840
software programs and machines that

29
00:01:10,840 --> 00:01:13,759
will be capable of performing tasks

30
00:01:13,759 --> 00:01:16,080
that would normally require human

31
00:01:16,080 --> 00:01:19,600
intelligence. But AI is a catchall term,

32
00:01:19,600 --> 00:01:22,920
so really when we talk about apply AI,

33
00:01:22,920 --> 00:01:25,920
how we use AI in our daily work, we are

34
00:01:25,920 --> 00:01:27,720
really going to be talking about machine

35
00:01:27,720 --> 00:01:30,000
learning. So machine learning is the

36
00:01:30,000 --> 00:01:31,680
design and application of software

37
00:01:31,680 --> 00:01:34,079
algorithms that are capable of learning

38
00:01:34,079 --> 00:01:37,960
on their own without any explicit human

39
00:01:37,960 --> 00:01:40,399
intervention. And the primary purpose of

40
00:01:40,399 --> 00:01:43,280
these algorithms are to optimize

41
00:01:43,280 --> 00:01:46,840
performance in a specific task. And the

42
00:01:46,840 --> 00:01:49,680
primary performance or the primary task

43
00:01:49,680 --> 00:01:52,000
that you want to optimize performance in

44
00:01:52,000 --> 00:01:54,240
is to be able to make accurate

45
00:01:54,240 --> 00:01:57,479
predictions about future outcomes based

46
00:01:57,479 --> 00:02:00,560
on the analysis of historical data

47
00:02:00,560 --> 00:02:02,960
from the past. So essentially machine

48
00:02:02,960 --> 00:02:05,320
learning is about making predictions

49
00:02:05,320 --> 00:02:06,880
about the future or what we call

50
00:02:06,880 --> 00:02:08,919
predictive analytics.

51
00:02:08,919 --> 00:02:11,000
And there are many different

52
00:02:11,000 --> 00:02:12,720
kinds of algorithms that are available in

53
00:02:12,720 --> 00:02:14,519
machine learning under the three primary

54
00:02:14,519 --> 00:02:16,440
categories of supervised learning,

55
00:02:16,440 --> 00:02:18,920
unsupervised learning, and reinforcement

56
00:02:18,920 --> 00:02:21,440
learning. And here we can see some of the

57
00:02:21,440 --> 00:02:23,560
different kinds of algorithms and their

58
00:02:23,560 --> 00:02:27,480
use cases in various areas in

59
00:02:27,480 --> 00:02:29,680
industry. So we have various domain use

60
00:02:29,680 --> 00:02:30,480
cases

61
00:02:30,480 --> 00:02:31,800
for all these different kind of

62
00:02:31,800 --> 00:02:33,840
algorithms, and we can see that different

63
00:02:33,840 --> 00:02:38,120
algorithms are fitted for different use cases.

64
00:02:38,120 --> 00:02:41,000
Deep learning is an advanced form

65
00:02:41,000 --> 00:02:42,400
of machine learning that's based on

66
00:02:42,400 --> 00:02:44,280
something called an artificial neural

67
00:02:44,280 --> 00:02:46,319
network or ANN for short, and this

68
00:02:46,319 --> 00:02:47,840
essentially simulates the structure of

69
00:02:47,840 --> 00:02:49,519
the human brain whereby neurons

70
00:02:49,519 --> 00:02:51,360
interconnect and work together to

71
00:02:51,360 --> 00:02:54,959
process and learn new information. So DL

72
00:02:54,959 --> 00:02:57,239
is the foundational technology for most

73
00:02:57,239 --> 00:02:59,360
of the popular AI tools that you

74
00:02:59,360 --> 00:03:01,400
probably have heard of today. So I'm sure

75
00:03:01,400 --> 00:03:03,200
you have heard of ChatGPT if you haven't

76
00:03:03,200 --> 00:03:05,360
been living in a cave for the past 2

77
00:03:05,360 --> 00:03:08,280
years. And yeah, so ChatGPT is an example

78
00:03:08,280 --> 00:03:10,120
of what we call a large language model

79
00:03:10,120 --> 00:03:11,599
and that's based on this technology

80
00:03:11,599 --> 00:03:14,879
called deep learning. Also, all the modern

81
00:03:14,879 --> 00:03:17,440
computer vision applications where a

82
00:03:17,440 --> 00:03:20,040
computer program can classify images or

83
00:03:20,040 --> 00:03:23,239
detect images or recognize images on

84
00:03:23,239 --> 00:03:25,280
its own, okay, we call this computer

85
00:03:25,280 --> 00:03:27,760
vision applications. They also use

86
00:03:27,760 --> 00:03:29,519
this particular form of machine learning

87
00:03:29,519 --> 00:03:31,560
called deep learning, right? So this is a

88
00:03:31,560 --> 00:03:33,640
example of an artificial neural network.

89
00:03:33,640 --> 00:03:35,200
For example, here I have an image of a

90
00:03:35,200 --> 00:03:37,159
bird that's fed into this artificial

91
00:03:37,159 --> 00:03:39,560
neural network, and output from this

92
00:03:39,560 --> 00:03:41,239
artificial neural network is a

93
00:03:41,239 --> 00:03:43,959
classification of this image into one of

94
00:03:43,959 --> 00:03:46,400
these three potential categories. So in

95
00:03:46,400 --> 00:03:49,080
this case, if the ANN has been trained

96
00:03:49,080 --> 00:03:51,799
properly, we fit in this image, this

97
00:03:51,799 --> 00:03:54,079
ANN should correctly classify this image

98
00:03:54,079 --> 00:03:56,879
as a bird, right? So this is a image

99
00:03:56,879 --> 00:03:58,959
classification problem which is a

100
00:03:58,959 --> 00:04:01,079
classic use case for an artificial

101
00:04:01,079 --> 00:04:03,929
neural network in the field of computer

102
00:04:03,929 --> 00:04:07,879
vision. And just like in the case of

103
00:04:07,879 --> 00:04:09,400
machine learning, there are a variety of

104
00:04:09,400 --> 00:04:11,640
algorithms that are available for

105
00:04:11,640 --> 00:04:13,599
deep learning under the category of

106
00:04:13,599 --> 00:04:15,000
supervised learning and also

107
00:04:15,000 --> 00:04:16,839
unsupervised learning.

108
00:04:16,839 --> 00:04:19,199
All right, so this is how we can

109
00:04:19,199 --> 00:04:20,839
kind of categorize this. You can think of

110
00:04:20,839 --> 00:04:23,880
AI is a general area of smart systems

111
00:04:23,880 --> 00:04:26,560
and machine. Machine learning is

112
00:04:26,560 --> 00:04:29,360
basically apply AI and deep learning

113
00:04:29,360 --> 00:04:29,823
is a

114
00:04:29,823 --> 00:04:32,560
subspecialization of machine learning

115
00:04:32,560 --> 00:04:35,000
using a particular architecture called

116
00:04:35,000 --> 00:04:38,760
an artificial neural network.

117
00:04:38,760 --> 00:04:42,160
And generative AI, so if you talk

118
00:04:42,160 --> 00:04:45,280
about ChatGPT, okay, Google Gemini,

119
00:04:45,280 --> 00:04:47,639
Microsoft Copilot, okay, all these

120
00:04:47,639 --> 00:04:49,600
examples of generative AI, they are

121
00:04:49,600 --> 00:04:51,600
basically large language models, and they

122
00:04:51,600 --> 00:04:53,880
are a further subcategory within the

123
00:04:53,880 --> 00:04:55,170
area of deep

124
00:04:55,170 --> 00:04:57,759
learning. And there are many applications

125
00:04:57,759 --> 00:04:59,400
of machine learning in industry right

126
00:04:59,400 --> 00:05:01,759
now, so pick which particular industry

127
00:05:01,759 --> 00:05:03,680
are you involved in, and these are all the

128
00:05:03,680 --> 00:05:05,060
specific areas of

129
00:05:05,060 --> 00:05:09,960
applications, right? So probably, I'm

130
00:05:09,960 --> 00:05:11,680
going to guess the vast majority of you

131
00:05:11,680 --> 00:05:12,880
who are watching this video, you're

132
00:05:12,880 --> 00:05:14,360
probably coming from the manufacturing

133
00:05:14,360 --> 00:05:16,639
industry, and so in the manufacturing

134
00:05:16,639 --> 00:05:18,479
industry some of the standard use cases

135
00:05:18,479 --> 00:05:20,039
for machine learning and deep learning

136
00:05:20,039 --> 00:05:23,080
are predicting potential problems, okay?

137
00:05:23,080 --> 00:05:25,319
So sometimes you call this predictive

138
00:05:25,319 --> 00:05:27,160
maintenance where you want to predict

139
00:05:27,160 --> 00:05:28,800
when a problem is going to happen and

140
00:05:28,800 --> 00:05:30,400
then kind of address it before it

141
00:05:30,400 --> 00:05:32,759
happens. And then monitoring systems,

142
00:05:32,759 --> 00:05:35,199
automating your manufacturing assembly

143
00:05:35,199 --> 00:05:37,880
line or production line, okay, smart

144
00:05:37,880 --> 00:05:40,120
scheduling, and detecting anomaly on your

145
00:05:40,120 --> 00:05:41,480
production line.

146
00:05:42,390 --> 00:05:44,160
Okay, so let's talk about the use

147
00:05:44,160 --> 00:05:45,680
case here which is predictive

148
00:05:45,680 --> 00:05:49,280
maintenance, right? So what is predictive

149
00:05:49,280 --> 00:05:51,720
maintenance? Well predictive maintenance,

150
00:05:51,720 --> 00:05:53,199
here's the long definition, is a

151
00:05:53,199 --> 00:05:54,639
equipment maintenance strategy that

152
00:05:54,639 --> 00:05:56,280
relies on real-time monitoring of

153
00:05:56,280 --> 00:05:58,360
equipment conditions and data to predict

154
00:05:58,360 --> 00:06:00,280
equipment failures in advance.

155
00:06:00,280 --> 00:06:02,680
And this uses advanced data models,

156
00:06:02,680 --> 00:06:05,240
analytics, and machine learning whereby

157
00:06:05,240 --> 00:06:07,479
we can reliably assess when failures are

158
00:06:07,479 --> 00:06:09,199
more likely to occur, including which

159
00:06:09,199 --> 00:06:11,120
components are more likely to be

160
00:06:11,120 --> 00:06:13,560
affected on your production or assembly

161
00:06:13,560 --> 00:06:16,599
line. So where does predictive

162
00:06:16,599 --> 00:06:18,759
maintenance fit into the overall scheme

163
00:06:18,759 --> 00:06:20,759
of things, right? So let's talk about the

164
00:06:20,759 --> 00:06:23,039
kind of standard way that, you know,

165
00:06:23,039 --> 00:06:25,520
factories or production

166
00:06:25,520 --> 00:06:27,680
lines, assembly lines in factories tend

167
00:06:27,680 --> 00:06:31,080
to handle maintenance issues say

168
00:06:31,080 --> 00:06:33,120
10 or 20 years ago, right? So what you

169
00:06:33,120 --> 00:06:34,520
have is the, what you would probably

170
00:06:34,520 --> 00:06:36,400
start off is the most basic mode

171
00:06:36,400 --> 00:06:38,240
which is reactive maintenance. So you

172
00:06:38,240 --> 00:06:40,680
just wait until your machine breaks down

173
00:06:40,680 --> 00:06:43,039
and then you repair, right? The simplest,

174
00:06:43,039 --> 00:06:44,720
but, of course, I'm sure if you have worked on a

175
00:06:44,720 --> 00:06:46,720
production line for any period of time,

176
00:06:46,720 --> 00:06:48,880
you know that this reactive maintenance

177
00:06:48,880 --> 00:06:50,759
can give you a whole bunch of headaches

178
00:06:50,759 --> 00:06:52,160
especially if the machine breaks down

179
00:06:52,160 --> 00:06:54,120
just before a critical delivery deadline,

180
00:06:54,120 --> 00:06:55,520
right? Then you're going to have a

181
00:06:55,520 --> 00:06:56,800
backlog of orders and you're going to

182
00:06:56,800 --> 00:06:59,160
run to a lot of problems. Okay, so we move on

183
00:06:59,160 --> 00:07:00,879
to preventive maintenance which is

184
00:07:00,879 --> 00:07:03,840
you regularly schedule a maintenance of

185
00:07:03,840 --> 00:07:07,000
your production machines to reduce

186
00:07:07,000 --> 00:07:08,800
the failure rate. So you might do

187
00:07:08,800 --> 00:07:10,520
maintenance once every month, once every

188
00:07:10,520 --> 00:07:13,120
two weeks, whatever. Okay, this is great,

189
00:07:13,120 --> 00:07:15,240
but the problem, of course, then is well

190
00:07:15,240 --> 00:07:16,199
sometimes you're doing too much

191
00:07:16,199 --> 00:07:18,400
maintenance, it's not really necessary,

192
00:07:18,400 --> 00:07:20,639
and it still doesn't totally prevent

193
00:07:20,639 --> 00:07:23,240
this, you know, a failure of the

194
00:07:23,240 --> 00:07:25,639
machine that occurs outside of your planned

195
00:07:25,639 --> 00:07:28,680
maintenance, right? So a bit of an

196
00:07:28,680 --> 00:07:31,160
improvement, but not that much better.

197
00:07:31,160 --> 00:07:33,280
And then, these last two categories is

198
00:07:33,280 --> 00:07:34,680
where we bring in AI and machine

199
00:07:34,680 --> 00:07:36,759
learning. So with machine learning, we're

200
00:07:36,759 --> 00:07:39,280
going to use sensors to do real-time

201
00:07:39,280 --> 00:07:41,759
monitoring of the data, and then using

202
00:07:41,759 --> 00:07:43,319
that data we're going to build a machine

203
00:07:43,319 --> 00:07:46,479
learning model which helps us to predict,

204
00:07:46,479 --> 00:07:50,000
with a reasonable level of accuracy, when

205
00:07:50,000 --> 00:07:52,520
the next failure is going to happen on

206
00:07:52,520 --> 00:07:54,440
your assembly or production line on a

207
00:07:54,440 --> 00:07:57,440
specific component or specific machine,

208
00:07:57,440 --> 00:07:59,520
right? So you just want to be predict to

209
00:07:59,520 --> 00:08:01,960
a high level of accuracy like maybe

210
00:08:01,960 --> 00:08:04,440
to the specific day, even the specific

211
00:08:04,440 --> 00:08:06,400
hour, or even minute itself when you

212
00:08:06,400 --> 00:08:08,360
expect that particular product to fail

213
00:08:08,360 --> 00:08:10,960
or the particular machine to fail. All

214
00:08:10,960 --> 00:08:12,639
right, so these are the advantages of

215
00:08:12,639 --> 00:08:14,879
predictive maintenance. It minimizes

216
00:08:14,879 --> 00:08:16,720
the occurrence of unscheduled downtime, it

217
00:08:16,720 --> 00:08:18,080
gives you a real-time overview of your

218
00:08:18,080 --> 00:08:19,919
current condition of assets, ensures

219
00:08:19,919 --> 00:08:22,680
minimal disruptions to productivity,

220
00:08:22,680 --> 00:08:24,720
optimizes time you spend on maintenance work,

221
00:08:24,720 --> 00:08:26,639
optimizes the use of spare parts, and so

222
00:08:26,639 --> 00:08:28,280
on. And of course there are some

223
00:08:28,280 --> 00:08:30,639
disadvantages, which is the

224
00:08:30,639 --> 00:08:32,559
primary one, you need a specialized set

225
00:08:32,559 --> 00:08:35,519
of skills among your engineers to

226
00:08:35,519 --> 00:08:37,719
understand and create machine learning

227
00:08:37,719 --> 00:08:40,599
models that can work on the real-time

228
00:08:40,599 --> 00:08:43,559
data that you're getting. Okay, so we're

229
00:08:43,559 --> 00:08:45,000
going to take a look at some real life

230
00:08:45,000 --> 00:08:47,200
use cases. So these are a bunch of links

231
00:08:47,200 --> 00:08:48,720
here, so if you navigate to these links

232
00:08:48,720 --> 00:08:50,120
here, you'll be able to get a look at

233
00:08:50,120 --> 00:08:54,360
some real life use cases of machine

234
00:08:54,360 --> 00:08:57,640
learning in predictive maintenance. So

235
00:08:57,640 --> 00:09:00,959
the IBM website, okay, gives you a look at

236
00:09:00,959 --> 00:09:04,880
a bunch of five use cases, so you can

237
00:09:04,880 --> 00:09:06,519
click on these links and follow up with

238
00:09:06,519 --> 00:09:08,279
them if you want to read more. Okay, this

239
00:09:08,279 --> 00:09:11,480
is waste management, manufacturing, okay,

240
00:09:11,480 --> 00:09:14,760
building services, and renewable energy,

241
00:09:14,760 --> 00:09:16,880
and also mining, right? So these are all

242
00:09:16,880 --> 00:09:18,279
use cases, if you want to know more about

243
00:09:18,279 --> 00:09:20,480
them, you can read up and follow them

244
00:09:20,480 --> 00:09:23,600
from this website. And this website

245
00:09:23,600 --> 00:09:25,760
gives, this is a pretty good website. I

246
00:09:25,760 --> 00:09:27,720
would really encourage you to just look

247
00:09:27,720 --> 00:09:28,880
through this if you're interested in

248
00:09:28,880 --> 00:09:31,160
predictive maintenance. So here, it tells

249
00:09:31,160 --> 00:09:34,279
you about, you know, an industry survey of

250
00:09:34,279 --> 00:09:36,360
predictive maintenance. We can see that a

251
00:09:36,360 --> 00:09:38,200
large portion of the industry,

252
00:09:38,200 --> 00:09:39,680
manufacturing industry agreed that

253
00:09:39,680 --> 00:09:41,360
predictive maintenance is a real need to

254
00:09:41,360 --> 00:09:43,959
stay competitive and predictive

255
00:09:43,959 --> 00:09:45,240
maintenance is essential for

256
00:09:45,240 --> 00:09:46,720
manufacturing industry and will gain

257
00:09:46,720 --> 00:09:48,279
additional strength in the future. So

258
00:09:48,279 --> 00:09:50,200
this is a survey that was done quite

259
00:09:50,200 --> 00:09:52,040
some time ago and this was the results

260
00:09:52,040 --> 00:09:53,880
that we got back. So we can see the vast

261
00:09:53,880 --> 00:09:55,720
majority of key industry players in the

262
00:09:55,720 --> 00:09:57,640
manufacturing sector, they consider

263
00:09:57,640 --> 00:09:59,000
predictive maintenance to be a very

264
00:09:59,000 --> 00:09:59,839
important

265
00:09:59,839 --> 00:10:01,600
activity that they want to

266
00:10:01,600 --> 00:10:04,519
incorporate into their workflow, right?

267
00:10:04,519 --> 00:10:07,720
And we can see here the kind of ROI that

268
00:10:07,720 --> 00:10:10,680
we expect on investment in predictive

269
00:10:10,680 --> 00:10:13,399
maintenance, so 45% reduction in downtime,

270
00:10:13,399 --> 00:10:17,120
25% growth in productivity, 75% fault

271
00:10:17,120 --> 00:10:19,480
elimination, 30% reduction in maintenance

272
00:10:19,480 --> 00:10:22,640
cost, okay? And best of all, if you really

273
00:10:22,640 --> 00:10:25,040
want to kind of take a look at examples,

274
00:10:25,040 --> 00:10:26,680
all right, so there are all these

275
00:10:26,680 --> 00:10:28,120
different companies that have

276
00:10:28,120 --> 00:10:30,160
significantly invested in predictive

277
00:10:30,160 --> 00:10:31,640
maintenance technology in their

278
00:10:31,640 --> 00:10:34,240
manufacturing processes. So PepsiCo, we

279
00:10:34,240 --> 00:10:38,965
have got Frito-Lay, General Motors, Mondi, Ecoplant,

280
00:10:38,965 --> 00:10:40,959
all right? So you can jump over here

281
00:10:40,959 --> 00:10:42,959
and take a look at some of these

282
00:10:42,959 --> 00:10:46,040
use cases. Let me perhaps, let me try and

283
00:10:46,040 --> 00:10:48,079
open this up, for example, Mondi, right? You

284
00:10:48,079 --> 00:10:51,880
can see Mondi has impl- oops. Mondi has used

285
00:10:51,880 --> 00:10:53,720
this particular piece of software

286
00:10:53,720 --> 00:10:55,839
called MATLAB, all right, or MathWorks

287
00:10:55,839 --> 00:10:59,760
sorry, to do predictive maintenance

288
00:10:59,760 --> 00:11:01,920
for their manufacturing processes using

289
00:11:01,920 --> 00:11:05,079
machine learning. And we can talk, you can

290
00:11:05,079 --> 00:11:07,680
study how they have used it, all right,

291
00:11:07,680 --> 00:11:09,000
and how it works, what was their

292
00:11:09,000 --> 00:11:10,920
challenge, all right, the problems they

293
00:11:10,920 --> 00:11:12,639
were facing, the solution that they use

294
00:11:12,639 --> 00:11:14,560
using this MathWorks Consulting piece of

295
00:11:14,560 --> 00:11:17,160
software, and data that they collected in

296
00:11:17,160 --> 00:11:20,399
a MATLAB database, all right, sorry

297
00:11:20,399 --> 00:11:23,639
in a Oracle database.

298
00:11:23,639 --> 00:11:26,399
So using MathWorks from MATLAB, all

299
00:11:26,399 --> 00:11:27,959
right, they were able to create a deep

300
00:11:27,959 --> 00:11:30,560
learning model to, you know, to

301
00:11:30,560 --> 00:11:32,839
solve this particular issue for their

302
00:11:32,839 --> 00:11:35,720
domain. So if you're interested, please, I

303
00:11:35,720 --> 00:11:37,639
strongly encourage you to read up on all

304
00:11:37,639 --> 00:11:40,440
these real life customer stories with

305
00:11:40,440 --> 00:11:43,403
showcase use cases for predictive

306
00:11:43,403 --> 00:11:48,240
maintenance. Okay, so that's it for

307
00:11:48,240 --> 00:11:52,200
real life use cases for predictive maintenance.

308
00:11:53,819 --> 00:11:56,600
Now in this topic, I'm

309
00:11:56,600 --> 00:11:58,000
going to talk about machine learning

310
00:11:58,000 --> 00:12:00,040
basics, so what is actually involved

311
00:12:00,040 --> 00:12:01,480
in machine learning, and I'm going to

312
00:12:01,480 --> 00:12:03,839
give a very quick, fast, conceptual, high

313
00:12:03,839 --> 00:12:05,920
level overview of machine learning, all

314
00:12:05,920 --> 00:12:09,000
right? So there are several categories of

315
00:12:09,000 --> 00:12:10,959
machine learning, supervised, unsupervised,

316
00:12:10,959 --> 00:12:13,000
semi-supervised, reinforcement, and deep

317
00:12:13,000 --> 00:12:15,880
learning, okay? And let's talk about the

318
00:12:15,880 --> 00:12:19,360
most common and widely used category of

319
00:12:19,360 --> 00:12:20,560
machine learning which is called

320
00:12:20,560 --> 00:12:25,040
supervised learning. So the particular use

321
00:12:25,040 --> 00:12:26,279
case here that I'm going to be

322
00:12:26,279 --> 00:12:28,560
discussing, predictive maintenance, it's

323
00:12:28,560 --> 00:12:31,320
basically a form of supervised learning.

324
00:12:31,320 --> 00:12:33,480
So how does supervised learning work?

325
00:12:33,480 --> 00:12:35,199
Well in supervised learning, you're going

326
00:12:35,199 --> 00:12:37,240
to create a machine learning model by

327
00:12:37,240 --> 00:12:39,360
providing what is called a labelled data

328
00:12:39,360 --> 00:12:41,680
set as a input to a machine learning

329
00:12:41,680 --> 00:12:44,680
program or algorithm. And this dataset

330
00:12:44,680 --> 00:12:46,440
is going to contain what is called an

331
00:12:46,440 --> 00:12:48,760
independent or feature variables, all

332
00:12:48,760 --> 00:12:51,240
right, so this will be a set of variables.

333
00:12:51,240 --> 00:12:52,959
And there will be one dependent or

334
00:12:52,959 --> 00:12:54,959
target variable which we also call the

335
00:12:54,959 --> 00:12:57,720
label, and the idea is that the

336
00:12:57,720 --> 00:12:59,839
independent or the feature variables are

337
00:12:59,839 --> 00:13:01,600
the attributes or properties of your

338
00:13:01,600 --> 00:13:04,160
data set that influence the dependent or

339
00:13:04,160 --> 00:13:07,760
the target variable, okay? So this process

340
00:13:07,760 --> 00:13:09,120
that I've just described is called

341
00:13:09,120 --> 00:13:11,600
training the machine learning model, and

342
00:13:11,600 --> 00:13:14,279
the model is fundamentally a

343
00:13:14,279 --> 00:13:16,399
mathematical function that best

344
00:13:16,399 --> 00:13:18,399
approximates the relationship between

345
00:13:18,399 --> 00:13:20,639
the independent variables and the

346
00:13:20,639 --> 00:13:22,639
dependent variable. All right, so that's

347
00:13:22,639 --> 00:13:24,480
quite a bit of a mouthful, so let's jump

348
00:13:24,480 --> 00:13:26,320
into a diagram that maybe illustrates

349
00:13:26,320 --> 00:13:27,880
this more clearly. So let's say you have

350
00:13:27,880 --> 00:13:30,000
a dataset here, an Excel spreadsheet,

351
00:13:30,000 --> 00:13:32,160
right? And this Excel spreadsheet has a

352
00:13:32,160 --> 00:13:34,040
bunch of columns here and a bunch of

353
00:13:34,040 --> 00:13:36,800
rows, okay? So these rows here represent

354
00:13:36,800 --> 00:13:39,000
observations, or these rows are what

355
00:13:39,000 --> 00:13:40,959
we call observations or samples or data

356
00:13:40,959 --> 00:13:43,120
points in our data set, okay? So let's

357
00:13:43,120 --> 00:13:46,880
assume this data set is gathered by a

358
00:13:46,880 --> 00:13:49,959
marketing manager at a mall, at a retail

359
00:13:49,959 --> 00:13:52,279
mall, all right? So they've got all this

360
00:13:52,279 --> 00:13:54,920
information about the customers who

361
00:13:54,920 --> 00:13:56,800
purchase products at this mall, all right?

362
00:13:56,800 --> 00:13:58,519
So some of the information they've

363
00:13:58,519 --> 00:14:00,000
gotten about the customers are their

364
00:14:00,000 --> 00:14:01,839
gender, their age, their income, and the

365
00:14:01,839 --> 00:14:03,600
number of children. So all this

366
00:14:03,600 --> 00:14:05,680
information about the customers, we call

367
00:14:05,680 --> 00:14:07,360
this the independent or the feature

368
00:14:07,360 --> 00:14:10,079
variables, all right? And based on all

369
00:14:10,079 --> 00:14:12,759
this information about the customer, we

370
00:14:12,759 --> 00:14:16,199
also managed to get some or we record

371
00:14:16,199 --> 00:14:17,600
the information about how much the

372
00:14:17,600 --> 00:14:20,480
customer spends, all right? So this

373
00:14:20,480 --> 00:14:22,079
information or these numbers here, we call

374
00:14:22,079 --> 00:14:23,839
this the target variable or the

375
00:14:23,839 --> 00:14:26,600
dependent variable, right? So on the

376
00:14:26,600 --> 00:14:29,519
single row, the data point, one single sample, one

377
00:14:29,519 --> 00:14:32,560
single data point, contains all the data

378
00:14:32,560 --> 00:14:35,040
for the feature variables and one single

379
00:14:35,040 --> 00:14:37,800
value for the label or the target

380
00:14:37,800 --> 00:14:41,199
variable, okay? And the primary purpose of

381
00:14:41,199 --> 00:14:43,240
the machine learning model is to create

382
00:14:43,240 --> 00:14:45,519
a mapping from all your feature

383
00:14:45,519 --> 00:14:48,160
variables to your target variable, so

384
00:14:48,160 --> 00:14:50,920
somehow there's going to be a function,

385
00:14:50,920 --> 00:14:52,160
okay, this will be a mathematical

386
00:14:52,160 --> 00:14:54,800
function that maps all the values of

387
00:14:54,800 --> 00:14:57,040
your feature variable to the value of

388
00:14:57,040 --> 00:14:59,639
your target variable. In other words, this

389
00:14:59,639 --> 00:15:01,279
function represents the relationship

390
00:15:01,279 --> 00:15:03,360
between your feature variables and your

391
00:15:03,360 --> 00:15:07,079
target variable, okay? So this whole thing,

392
00:15:07,079 --> 00:15:08,560
this training process, we call this the

393
00:15:08,560 --> 00:15:11,320
fitting the model. And the target

394
00:15:11,320 --> 00:15:13,240
variable or the label, this thing here,

395
00:15:13,240 --> 00:15:15,120
this column here, or the values here,

396
00:15:15,120 --> 00:15:17,399
these are critical for providing a

397
00:15:17,399 --> 00:15:19,000
context to do the fitting or the

398
00:15:19,000 --> 00:15:21,160
training of the model. And once you've

399
00:15:21,160 --> 00:15:23,360
got a trained and fitted model, you can

400
00:15:23,360 --> 00:15:25,959
then use the model to make an accurate

401
00:15:25,959 --> 00:15:28,319
prediction of target values

402
00:15:28,319 --> 00:15:30,240
corresponding to new feature values that

403
00:15:30,240 --> 00:15:32,519
the model has yet to encounter or yet to

404
00:15:32,519 --> 00:15:34,759
see, and this, as I've already said

405
00:15:34,759 --> 00:15:36,240
earlier, this is called predictive

406
00:15:36,240 --> 00:15:38,480
analytics, okay? So let's see what's

407
00:15:38,480 --> 00:15:40,120
actually happening here, you take your

408
00:15:40,120 --> 00:15:43,079
training data, all right, so this is this

409
00:15:43,079 --> 00:15:44,880
whole bunch of data, this data set here

410
00:15:44,880 --> 00:15:47,440
consisting of a thousand rows of

411
00:15:47,440 --> 00:15:49,920
data, 10,000 rows of data, you take this

412
00:15:49,920 --> 00:15:52,040
entire data set, all right, this entire

413
00:15:52,040 --> 00:15:54,000
data set, you jam it into your machine

414
00:15:54,000 --> 00:15:56,519
learning algorithm, and a couple of hours

415
00:15:56,519 --> 00:15:58,079
later your machine learning algorithm

416
00:15:58,079 --> 00:16:01,360
comes up with a model. And the model is

417
00:16:01,360 --> 00:16:04,199
essentially a function that maps all

418
00:16:04,199 --> 00:16:05,959
your feature variables which is these

419
00:16:05,959 --> 00:16:08,199
four columns here, to your target

420
00:16:08,199 --> 00:16:10,440
variable which is this one single column

421
00:16:10,440 --> 00:16:14,279
here, okay? So once you have the model, you

422
00:16:14,279 --> 00:16:17,040
can put in a new data point. So basically

423
00:16:17,040 --> 00:16:19,079
the new data point represents data about a

424
00:16:19,079 --> 00:16:20,959
new customer, a new customer that you

425
00:16:20,959 --> 00:16:23,120
have never seen before. So let's say

426
00:16:23,120 --> 00:16:25,079
you've already got information about

427
00:16:25,079 --> 00:16:27,560
10,000 customers that have visited this

428
00:16:27,560 --> 00:16:29,920
mall and how much each of these 10,000

429
00:16:29,920 --> 00:16:31,519
customers have spent when they are at this

430
00:16:31,519 --> 00:16:34,040
mall. So now you have a totally new

431
00:16:34,040 --> 00:16:35,800
customer that comes in the mall, this

432
00:16:35,800 --> 00:16:37,800
customer has never come into this mall

433
00:16:37,800 --> 00:16:39,839
before, and what we know about this

434
00:16:39,839 --> 00:16:42,680
customer is that he is a male, the age is

435
00:16:42,680 --> 00:16:45,199
50, the income is 18, and they have nine

436
00:16:45,199 --> 00:16:48,160
children. So now when you take this data

437
00:16:48,160 --> 00:16:50,519
and you pump that into your model, your

438
00:16:50,519 --> 00:16:52,920
model is going to make a prediction, it's

439
00:16:52,920 --> 00:16:55,720
going to say, hey, you know what? Based on

440
00:16:55,720 --> 00:16:57,279
everything that I have been trained before

441
00:16:57,279 --> 00:16:59,360
and based on the model I've developed,

442
00:16:59,360 --> 00:17:01,959
I am going to predict that a customer

443
00:17:01,959 --> 00:17:04,880
that is of a male gender, of the age 50

444
00:17:04,880 --> 00:17:08,280
with the income of 18, and nine children,

445
00:17:08,280 --> 00:17:12,400
that customer is going to spend 25 ringgit

446
00:17:12,400 --> 00:17:15,839
at the mall. And this is it, this is what

447
00:17:15,839 --> 00:17:18,599
you want. Right there, right here,

448
00:17:18,599 --> 00:17:21,319
can you see here? That is the final

449
00:17:21,319 --> 00:17:23,480
output of your machine learning model.

450
00:17:23,480 --> 00:17:27,359
It's going to make a prediction about

451
00:17:27,359 --> 00:17:29,760
something that it has not ever seen

452
00:17:29,760 --> 00:17:32,919
before, okay? That is the core, this is

453
00:17:32,919 --> 00:17:35,520
essentially the core of machine learning.

454
00:17:35,520 --> 00:17:38,640
Predictive analytics, making prediction

455
00:17:38,640 --> 00:17:40,120
about the future

456
00:17:41,170 --> 00:17:43,799
based on a historical data set.

457
00:17:44,379 --> 00:17:47,440
Okay, so there are two areas of

458
00:17:47,440 --> 00:17:49,480
supervised learning, regression and

459
00:17:49,480 --> 00:17:51,400
classification. So regression is used to

460
00:17:51,400 --> 00:17:53,440
predict a numerical target variable, such

461
00:17:53,440 --> 00:17:55,320
as the price of a house or the salary of

462
00:17:55,320 --> 00:17:57,799
an employee, whereas classification is

463
00:17:57,799 --> 00:17:59,919
used to predict a categorical target

464
00:17:59,919 --> 00:18:03,559
variable or class label, okay? So for

465
00:18:03,559 --> 00:18:05,799
classification you can have either

466
00:18:05,799 --> 00:18:08,679
binary or multiclass, so, for example,

467
00:18:08,679 --> 00:18:11,559
binary will be just true or false, zero

468
00:18:11,559 --> 00:18:14,840
or one. So whether your machine is going

469
00:18:14,840 --> 00:18:17,360
to fail or is it not going to fail, right?

470
00:18:17,360 --> 00:18:19,000
So just two classes, two possible,

471
00:18:19,000 --> 00:18:21,640
outcomes, or is the customer going to

472
00:18:21,640 --> 00:18:23,679
make a purchase or is the customer not

473
00:18:23,679 --> 00:18:26,159
going to make a purchase. We call this

474
00:18:26,159 --> 00:18:28,120
binary classification. And then for

475
00:18:28,120 --> 00:18:29,679
multiclass, when there are more than two

476
00:18:29,679 --> 00:18:32,559
classes or types of values. So, for

477
00:18:32,559 --> 00:18:34,039
example, here this would be a

478
00:18:34,039 --> 00:18:35,760
classification problem. So if you have a

479
00:18:35,760 --> 00:18:37,960
data set here, you've got information

480
00:18:37,960 --> 00:18:39,360
about your customers, you've got your

481
00:18:39,360 --> 00:18:41,159
gender of the customer, the age of the

482
00:18:41,159 --> 00:18:42,919
customer, the salary of the customer, and

483
00:18:42,919 --> 00:18:44,640
you also have record about whether the

484
00:18:44,640 --> 00:18:47,679
customer made a purchase or not, okay? So

485
00:18:47,679 --> 00:18:50,080
you can take this data set to train a

486
00:18:50,080 --> 00:18:52,440
classification model, and then the

487
00:18:52,440 --> 00:18:54,120
classification model can then make a

488
00:18:54,120 --> 00:18:56,320
prediction about a new customer, and

489
00:18:56,320 --> 00:18:58,799
they're going to predict zero which

490
00:18:58,799 --> 00:19:00,480
means the customer didn't make a

491
00:19:00,480 --> 00:19:03,159
purchase or one which means the customer

492
00:19:03,159 --> 00:19:06,320
make a purchase, right? And regression,

493
00:19:06,320 --> 00:19:08,600
this is regression, so let's say you want

494
00:19:08,600 --> 00:19:11,280
to predict the wind speed, and you've got

495
00:19:11,280 --> 00:19:13,799
historical data about all these four

496
00:19:13,799 --> 00:19:16,559
other independent variables or feature

497
00:19:16,559 --> 00:19:18,039
variables, so you have recorded

498
00:19:18,039 --> 00:19:19,640
temperature, the pressure, the relative

499
00:19:19,640 --> 00:19:21,799
humidity, and the wind direction for the

500
00:19:21,799 --> 00:19:24,799
past 10 days, 15 days, or whatever, okay? So

501
00:19:24,799 --> 00:19:26,760
now you are going to train your machine

502
00:19:26,760 --> 00:19:28,720
learning model using this data set, and

503
00:19:28,720 --> 00:19:31,679
the target variable column, okay, this

504
00:19:31,679 --> 00:19:33,760
column here, the label is basically a

505
00:19:33,760 --> 00:19:37,080
number, right? So now with this number,

506
00:19:37,080 --> 00:19:39,600
this is a regression model, and so now

507
00:19:39,600 --> 00:19:41,760
you can put in a new data point, so a new

508
00:19:41,760 --> 00:19:45,080
data point means a new set of values for

509
00:19:45,080 --> 00:19:46,960
temperature, pressure, relative humidity,

510
00:19:46,960 --> 00:19:48,600
and wind direction, and your machine

511
00:19:48,600 --> 00:19:50,679
learning model will then predict the

512
00:19:50,679 --> 00:19:53,640
wind speed for that new data point, okay?

513
00:19:53,640 --> 00:19:57,480
So that's a regression model.

514
00:19:59,159 --> 00:20:02,280
All right. So in this particular topic

515
00:20:02,280 --> 00:20:04,919
I'm going to talk about the workflow of

516
00:20:04,919 --> 00:20:07,960
that's involved in machine learning. So

517
00:20:07,960 --> 00:20:12,640
in the previous slides, I talked about

518
00:20:12,640 --> 00:20:14,600
developing the model, all right? But

519
00:20:14,600 --> 00:20:16,360
that's just one part of the entire

520
00:20:16,360 --> 00:20:19,080
workflow. So in real life when you use

521
00:20:19,080 --> 00:20:20,480
machine learning, there's an end-to-end

522
00:20:20,480 --> 00:20:22,480
workflow that's involved. So the first

523
00:20:22,480 --> 00:20:24,159
thing, of course, is you need to get your

524
00:20:24,159 --> 00:20:26,880
data, and then you need to clean your

525
00:20:26,880 --> 00:20:29,000
data, and then you need to explore your

526
00:20:29,000 --> 00:20:30,799
data. You need to see what's going on in

527
00:20:30,799 --> 00:20:33,280
your data set, right? And your data set,

528
00:20:33,280 --> 00:20:35,720
real life data sets are not trivial, they

529
00:20:35,720 --> 00:20:38,760
are hundreds of rows, thousands of rows,

530
00:20:38,760 --> 00:20:40,640
sometimes millions of rows, billions of

531
00:20:40,640 --> 00:20:43,080
rows, we're talking about billions or

532
00:20:43,080 --> 00:20:45,120
millions of data points especially if

533
00:20:45,120 --> 00:20:47,120
you're using an IoT sensor to get data

534
00:20:47,120 --> 00:20:49,000
in real time. So you've got all these

535
00:20:49,000 --> 00:20:51,320
super large data sets, you need to clean

536
00:20:51,320 --> 00:20:53,400
them, and explore them, and then you need

537
00:20:53,400 --> 00:20:56,360
to prepare them into a right format so

538
00:20:56,360 --> 00:20:59,600
that you can put them into the training

539
00:20:59,600 --> 00:21:01,520
process to create your machine learning

540
00:21:01,520 --> 00:21:04,799
model, and then subsequently you check

541
00:21:04,799 --> 00:21:07,559
how good is the model, right? How accurate

542
00:21:07,559 --> 00:21:10,080
is the model in terms of its ability to

543
00:21:10,080 --> 00:21:12,559
generate predictions for the

544
00:21:12,559 --> 00:21:14,960
future, right? How accurate are the

545
00:21:14,960 --> 00:21:16,679
predictions that are coming up from your

546
00:21:16,679 --> 00:21:18,400
machine learning model. So that's

547
00:21:18,400 --> 00:21:20,760
validating or evaluating your model, and

548
00:21:20,760 --> 00:21:22,559
then subsequently if you determine that

549
00:21:22,559 --> 00:21:25,400
your model is of adequate accuracy to

550
00:21:25,400 --> 00:21:27,240
meet whatever your domain use case

551
00:21:27,240 --> 00:21:29,400
requirements are, right? So let's say the

552
00:21:29,400 --> 00:21:31,440
accuracy that's required for your domain

553
00:21:31,440 --> 00:21:32,440
use case is

554
00:21:32,440 --> 00:21:35,320
85%, okay? If my machine learning model

555
00:21:35,320 --> 00:21:38,520
can give an 85% accuracy rate, I think

556
00:21:38,520 --> 00:21:40,159
it's good enough, then I'm going to

557
00:21:40,159 --> 00:21:42,880
deploy it into real world use case. So

558
00:21:42,880 --> 00:21:45,000
here the machine learning model gets

559
00:21:45,000 --> 00:21:48,440
deployed on the server, and then other,

560
00:21:48,440 --> 00:21:50,760
you know, other data sources are going to

561
00:21:50,760 --> 00:21:52,559
be captured from somewhere. That data is

562
00:21:52,559 --> 00:21:54,200
pump into the machine learning model. The

563
00:21:54,200 --> 00:21:55,440
machine learning model generates

564
00:21:55,440 --> 00:21:57,760
predictions, and those predictions are

565
00:21:57,760 --> 00:21:59,600
then used to make decisions on the

566
00:21:59,600 --> 00:22:02,000
factory floor in real time or in any

567
00:22:02,000 --> 00:22:04,559
other particular scenario. And then you

568
00:22:04,559 --> 00:22:06,840
constantly monitor and update the model,

569
00:22:06,840 --> 00:22:09,360
you get more new data, and then the

570
00:22:09,360 --> 00:22:11,960
entire cycle repeats itself. So that's

571
00:22:11,960 --> 00:22:14,480
your machine learning workflow, okay, in a

572
00:22:14,480 --> 00:22:16,919
nutshell. Here's another example of

573
00:22:16,919 --> 00:22:18,520
the same thing maybe in a slightly

574
00:22:18,520 --> 00:22:20,039
different format, so, again, you have your

575
00:22:20,039 --> 00:22:22,159
data collection and preparation. Here we

576
00:22:22,159 --> 00:22:24,360
talk more about the different kinds of

577
00:22:24,360 --> 00:22:26,520
algorithms that available to create a

578
00:22:26,520 --> 00:22:28,120
model, and I'll talk about this more in

579
00:22:28,120 --> 00:22:30,000
detail when we look at the real world

580
00:22:30,000 --> 00:22:32,320
example of a end-to-end machine learning

581
00:22:32,320 --> 00:22:34,559
workflow for the predictive maintenance

582
00:22:34,559 --> 00:22:36,880
use case. So once you have chosen the

583
00:22:36,880 --> 00:22:38,840
appropriate algorithm, you then have

584
00:22:38,840 --> 00:22:41,240
trained your model, you then have

585
00:22:41,240 --> 00:22:44,080
selected the appropriate train model

586
00:22:44,080 --> 00:22:46,440
among the multiple models. You are

587
00:22:46,440 --> 00:22:47,799
probably going to develop multiple

588
00:22:47,799 --> 00:22:49,559
models from multiple algorithms, you're

589
00:22:49,559 --> 00:22:51,679
going to evaluate them all, and then

590
00:22:51,679 --> 00:22:53,200
you're going to say, hey, you know what?

591
00:22:53,200 --> 00:22:55,279
After I've evaluated and tested that,

592
00:22:55,279 --> 00:22:57,480
I've chosen the best model, I'm going to

593
00:22:57,480 --> 00:22:59,640
deploy the model, all right, so this is

594
00:22:59,640 --> 00:23:02,640
for real life production use, okay? Real

595
00:23:02,640 --> 00:23:04,279
life sensor data is going to be pumped

596
00:23:04,279 --> 00:23:06,039
into my model, my model is going to

597
00:23:06,039 --> 00:23:08,039
generate predictions, the predicted data

598
00:23:08,039 --> 00:23:10,120
is going to used immediately in real

599
00:23:10,120 --> 00:23:12,840
time for real life decision making, and

600
00:23:12,840 --> 00:23:15,000
then I'm going to monitor, right, the

601
00:23:15,000 --> 00:23:17,440
results. So somebody's using the

602
00:23:17,440 --> 00:23:19,279
predictions from my model, if the

603
00:23:19,279 --> 00:23:21,880
predictions are lousy, that goes into the

604
00:23:21,880 --> 00:23:23,440
monitoring, the monitoring system

605
00:23:23,440 --> 00:23:25,279
captures that. If the predictions are

606
00:23:25,279 --> 00:23:27,720
fantastic, well that is also captured by the

607
00:23:27,720 --> 00:23:29,799
monitoring system, and that gets

608
00:23:29,799 --> 00:23:32,360
feedback again to the next cycle of my

609
00:23:32,360 --> 00:23:33,679
machine learning

610
00:23:33,679 --> 00:23:35,960
pipeline. Okay, so that's the kind of

611
00:23:35,960 --> 00:23:38,360
overall view, and here are the kind of

612
00:23:38,360 --> 00:23:41,559
key phases of your workflow. So one of

613
00:23:41,559 --> 00:23:43,960
the important phases is called EDA,

614
00:23:43,960 --> 00:23:47,520
exploratory data analysis and in this

615
00:23:47,520 --> 00:23:49,880
particular phase, you're going to

616
00:23:49,880 --> 00:23:53,120
do a lot of stuff, primarily just to

617
00:23:53,120 --> 00:23:54,880
understand your data set. So like I said,

618
00:23:54,880 --> 00:23:56,559
real life data sets, they tend to be very

619
00:23:56,559 --> 00:23:59,320
complex, and they tend to have various

620
00:23:59,320 --> 00:24:01,039
statistical properties, all right,

621
00:24:01,039 --> 00:24:02,679
statistics is a very important component

622
00:24:02,679 --> 00:24:05,600
of machine learning. So an EDA helps you

623
00:24:05,600 --> 00:24:07,480
to kind of get an overview of your data

624
00:24:07,480 --> 00:24:09,679
set, get an overview of any problems in

625
00:24:09,679 --> 00:24:11,520
your data set like any data that's

626
00:24:11,520 --> 00:24:13,440
missing, the statistical properties of your

627
00:24:13,440 --> 00:24:15,159
data set, the distribution of your data

628
00:24:15,159 --> 00:24:17,279
set, the statistical correlation of

629
00:24:17,279 --> 00:24:19,190
variables in your data set, etc,

630
00:24:19,190 --> 00:24:23,400
etc. Okay, then we have data cleaning or

631
00:24:23,400 --> 00:24:25,279
sometimes you call it data cleansing, and

632
00:24:25,279 --> 00:24:27,600
in this phase what you want to do is

633
00:24:27,600 --> 00:24:29,440
primarily, you want to kind of do things

634
00:24:29,440 --> 00:24:31,960
like remove duplicate records or rows in

635
00:24:31,960 --> 00:24:33,679
your table, you want to make sure that

636
00:24:33,679 --> 00:24:36,799
your data or your data

637
00:24:36,799 --> 00:24:39,399
points or your samples have appropriate IDs,

638
00:24:39,399 --> 00:24:41,080
and most importantly, you want to make

639
00:24:41,080 --> 00:24:43,039
sure there's not too many missing values

640
00:24:43,039 --> 00:24:44,880
in your data set. So what I mean by

641
00:24:44,880 --> 00:24:46,320
missing values are things like that,

642
00:24:46,320 --> 00:24:48,200
right? You have got a data set, and for

643
00:24:48,200 --> 00:24:51,640
some reason there are some cells or

644
00:24:51,640 --> 00:24:54,559
locations in your data set which are

645
00:24:54,559 --> 00:24:56,520
missing values, right? And if you have a

646
00:24:56,520 --> 00:24:58,679
lot of these missing values, then you've

647
00:24:58,679 --> 00:25:00,440
got a poor quality data set, and you're

648
00:25:00,440 --> 00:25:02,200
not going to be able to build a good

649
00:25:02,200 --> 00:25:04,159
model from this data set. You're not

650
00:25:04,159 --> 00:25:06,000
going to be able to train a good machine

651
00:25:06,000 --> 00:25:08,120
learning model from a data set with a

652
00:25:08,120 --> 00:25:10,200
lot of missing values like this. So you

653
00:25:10,200 --> 00:25:11,880
have to figure out whether there are a

654
00:25:11,880 --> 00:25:13,399
lot of missing values in your data set,

655
00:25:13,399 --> 00:25:15,399
how do you handle them. Another thing

656
00:25:15,399 --> 00:25:16,919
that's important in data cleansing is

657
00:25:16,919 --> 00:25:18,799
figuring out the outliers in your data

658
00:25:18,799 --> 00:25:21,919
set. So outliers are things like this,

659
00:25:21,919 --> 00:25:24,039
you know, data points that are very far from

660
00:25:24,039 --> 00:25:26,440
the general trend of data points in your

661
00:25:26,440 --> 00:25:29,559
data set, right? And so there are also

662
00:25:29,559 --> 00:25:31,919
several ways to detect outliers in your

663
00:25:31,919 --> 00:25:34,200
data set, and there are several ways to

664
00:25:34,200 --> 00:25:36,640
handle outliers in your data set.

665
00:25:36,640 --> 00:25:38,200
Similarly as well, there are several ways

666
00:25:38,200 --> 00:25:39,960
to handle missing values in your data

667
00:25:39,960 --> 00:25:42,880
set. So handling missing values, handling

668
00:25:42,880 --> 00:25:45,679
outliers, those are really two very key

669
00:25:45,679 --> 00:25:47,279
importance of data

670
00:25:47,279 --> 00:25:49,120
cleansing, and there are many, many

671
00:25:49,120 --> 00:25:50,760
techniques to handle this, so a data

672
00:25:50,760 --> 00:25:52,000
scientist needs to be acquainted with

673
00:25:52,000 --> 00:25:55,360
all of this. All right, why do I need to

674
00:25:55,360 --> 00:25:58,000
do data cleansing? Well, here is the key

675
00:25:58,000 --> 00:25:59,360
point.

676
00:25:59,360 --> 00:26:02,799
If you have a very poor quality data set,

677
00:26:02,799 --> 00:26:04,880
which means you've got a lot of outliers

678
00:26:04,880 --> 00:26:06,720
which are errors in your data set, or you

679
00:26:06,720 --> 00:26:08,159
got a lot of missing values in your data

680
00:26:08,159 --> 00:26:10,840
set, even though you've got a fantastic

681
00:26:10,840 --> 00:26:13,039
algorithm, you've got a fantastic model,

682
00:26:13,039 --> 00:26:15,720
the predictions that your model is going

683
00:26:15,720 --> 00:26:18,960
to give is absolutely rubbish. It's kind

684
00:26:18,960 --> 00:26:22,080
of like taking water and putting water

685
00:26:22,080 --> 00:26:26,000
into the tank of a Mercedes-Benz. So

686
00:26:26,000 --> 00:26:28,440
Mercedes-Benz is a great car, but if you

687
00:26:28,440 --> 00:26:30,080
take water and put it into your

688
00:26:30,080 --> 00:26:33,399
Mercedes-Benz, it will just die, right? Your

689
00:26:33,399 --> 00:26:36,520
car will just die, it can't run on water,

690
00:26:36,520 --> 00:26:38,279
right? On the other hand, if you have a

691
00:26:38,279 --> 00:26:41,559
Myvi, Myvi is just a lousy, shit car, but if

692
00:26:41,559 --> 00:26:44,840
you take a high octane, good petrol and

693
00:26:44,840 --> 00:26:47,240
you put into a Myvi, the Myvi will just go at,

694
00:26:47,240 --> 00:26:49,480
you know, 100 miles an hour. It would just

695
00:26:49,480 --> 00:26:51,159
completely destroy the Mercedes-Benz in

696
00:26:51,159 --> 00:26:53,360
terms of performance, so it

697
00:26:53,360 --> 00:26:54,799
doesn't really matter what model you're

698
00:26:54,799 --> 00:26:57,080
using here, right? So you can be using the most

699
00:26:57,080 --> 00:26:58,679
fantastic model like the

700
00:26:58,679 --> 00:27:01,200
Mercedes-Benz or machine learning, but if

701
00:27:01,200 --> 00:27:03,080
your data is lousy quality, your

702
00:27:03,080 --> 00:27:06,480
predictions is also going to be rubbish,

703
00:27:06,480 --> 00:27:10,000
okay? So cleansing data set is, in fact,

704
00:27:10,000 --> 00:27:11,880
probably the most important thing that

705
00:27:11,880 --> 00:27:13,640
data scientists need to do and that's

706
00:27:13,640 --> 00:27:15,520
what they spend most of the time doing,

707
00:27:15,520 --> 00:27:17,600
right, building the model, training the

708
00:27:17,600 --> 00:27:20,240
model, getting the right algorithms, and

709
00:27:20,240 --> 00:27:23,240
so on, that's really a small portion of

710
00:27:23,240 --> 00:27:25,200
the actual machine learning workflow,

711
00:27:25,200 --> 00:27:27,360
right? The actual machine learning

712
00:27:27,360 --> 00:27:29,679
workflow, the vast majority of time is on

713
00:27:29,679 --> 00:27:31,559
cleaning and organizing your

714
00:27:31,559 --> 00:27:33,360
data. Then you have something called

715
00:27:33,360 --> 00:27:35,080
feature engineering which is you

716
00:27:35,080 --> 00:27:37,000
preprocess the feature variables of

717
00:27:37,000 --> 00:27:38,919
your original data set prior to using

718
00:27:38,919 --> 00:27:40,600
them to train the model, and this is

719
00:27:40,600 --> 00:27:41,960
either through addition, deletion,

720
00:27:41,960 --> 00:27:43,600
combination, or transformation of these

721
00:27:43,600 --> 00:27:45,399
variables. And then the idea is you want

722
00:27:45,399 --> 00:27:47,000
to improve the predictive accuracy of

723
00:27:47,000 --> 00:27:49,320
the model, and also because some models

724
00:27:49,320 --> 00:27:51,080
can only work with numeric data, so you

725
00:27:51,080 --> 00:27:53,720
need to transform categorical data into

726
00:27:53,720 --> 00:27:57,039
numeric data. All right, so just now, in

727
00:27:57,039 --> 00:27:58,799
the earlier slides, I showed you that you

728
00:27:58,799 --> 00:28:00,760
take your original data set, you pump it

729
00:28:00,760 --> 00:28:03,200
into algorithm, and then a couple of hours

730
00:28:03,200 --> 00:28:05,200
later, you get a machine learning model,

731
00:28:05,200 --> 00:28:08,640
right? So you didn't do anything to your

732
00:28:08,640 --> 00:28:10,159
data set, to the feature variables in

733
00:28:10,159 --> 00:28:12,159
your data set before you pump it into a

734
00:28:12,159 --> 00:28:14,399
machine learning algorithm. So

735
00:28:14,399 --> 00:28:15,840
what I showed you earlier is you just

736
00:28:15,840 --> 00:28:18,919
take the data set exactly as it is and

737
00:28:18,919 --> 00:28:20,799
you just pump it into the algorithm,

738
00:28:20,799 --> 00:28:23,120
couple of hours later, you get a model,

739
00:28:23,120 --> 00:28:27,640
right? But that's not what generally

740
00:28:27,640 --> 00:28:29,600
happens in in real life. In real life,

741
00:28:29,600 --> 00:28:31,559
you're going to take all the original

742
00:28:31,559 --> 00:28:34,320
feature variables from your data set and

743
00:28:34,320 --> 00:28:36,720
you're going to transform them in some

744
00:28:36,720 --> 00:28:38,960
way. So you can see here these are the

745
00:28:38,960 --> 00:28:42,120
columns of data from my original data set,

746
00:28:42,120 --> 00:28:46,039
and before I actually put all these data

747
00:28:46,039 --> 00:28:48,240
points from my original data set into my

748
00:28:48,240 --> 00:28:50,720
algorithm to train and get my model, I

749
00:28:50,720 --> 00:28:54,960
will actually transform them, okay? So the

750
00:28:54,960 --> 00:28:57,600
transformation of these feature variable

751
00:28:57,600 --> 00:29:00,600
values, we call this feature engineering.

752
00:29:00,600 --> 00:29:02,440
And there are many, many techniques to do

753
00:29:02,440 --> 00:29:04,960
feature engineering, so one-hot encoding,

754
00:29:04,960 --> 00:29:08,279
scaling, log transformation,

755
00:29:08,279 --> 00:29:10,480
discretization, date extraction, boolean

756
00:29:10,480 --> 00:29:12,039
logic, etc, etc.

757
00:29:12,039 --> 00:29:14,880
Okay, then finally we do something

758
00:29:14,880 --> 00:29:16,799
called a train-test split, so where we

759
00:29:16,799 --> 00:29:19,440
take our original dataset, right? So this

760
00:29:19,440 --> 00:29:21,360
was the original dataset, and we break

761
00:29:21,360 --> 00:29:23,720
it into two parts, so one is called the

762
00:29:23,720 --> 00:29:25,760
training dataset and the other is

763
00:29:25,760 --> 00:29:28,120
called the test dataset. And the primary

764
00:29:28,120 --> 00:29:30,000
purpose for this is when we feed and

765
00:29:30,000 --> 00:29:31,399
train the machine learning model, we're

766
00:29:31,399 --> 00:29:32,640
going to use what is called the training

767
00:29:32,640 --> 00:29:35,559
dataset, and when we want to evaluate

768
00:29:35,559 --> 00:29:37,399
the accuracy of the model, right? So this

769
00:29:37,399 --> 00:29:40,960
is the key part of your machine learning

770
00:29:40,960 --> 00:29:43,640
life cycle because you are not only just

771
00:29:43,640 --> 00:29:45,440
going to have one possible models

772
00:29:45,440 --> 00:29:47,720
because there are a vast range of

773
00:29:47,720 --> 00:29:50,080
algorithms that you can use to create a

774
00:29:50,080 --> 00:29:53,000
model. So fundamentally you have a wide

775
00:29:53,000 --> 00:29:55,679
range of choices, right, like wide range

776
00:29:55,679 --> 00:29:57,640
of cars, right? You want to buy a car, you

777
00:29:57,640 --> 00:30:00,559
can buy a Myvi, you can buy a Perodua,

778
00:30:00,559 --> 00:30:02,640
you can buy a Honda, you can buy a

779
00:30:02,640 --> 00:30:05,039
Mercedes-Benz, you can buy a Audi, you can

780
00:30:05,039 --> 00:30:07,760
buy a beamer, many, many different cars

781
00:30:07,760 --> 00:30:09,240
that available for you if you want

782
00:30:09,240 --> 00:30:11,679
to buy a car, right? Same thing. With a

783
00:30:11,679 --> 00:30:14,360
machine learning model there are a vast

784
00:30:14,360 --> 00:30:16,720
variety of algorithms that you can

785
00:30:16,720 --> 00:30:19,480
choose from in order to create a model,

786
00:30:19,480 --> 00:30:21,519
and so once you create a model from a

787
00:30:21,519 --> 00:30:24,480
given algorithm you need to say, hey, how

788
00:30:24,480 --> 00:30:26,440
accurate is this model that I've created

789
00:30:26,440 --> 00:30:28,640
from this algorithm. And different

790
00:30:28,640 --> 00:30:30,399
algorithms are going to create different

791
00:30:30,399 --> 00:30:33,720
models with different rates of accuracy.

792
00:30:33,720 --> 00:30:35,679
And so the primary purpose of the test

793
00:30:35,679 --> 00:30:38,200
dataset is to evaluate the accuracy

794
00:30:38,200 --> 00:30:41,480
of the model to see hey, is this model

795
00:30:41,480 --> 00:30:43,360
that I've created using this algorithm,

796
00:30:43,360 --> 00:30:45,880
is it adequate for me to use in a real

797
00:30:45,880 --> 00:30:48,600
life production use case? Okay? So that's

798
00:30:48,600 --> 00:30:52,320
what it's all about. Okay, so this is my

799
00:30:52,320 --> 00:30:54,279
original dataset, I break it into my

800
00:30:54,279 --> 00:30:56,559
feature dataset and

801
00:30:56,559 --> 00:30:58,519
also my target variable column, so my

802
00:30:58,519 --> 00:31:00,639
feature variable columns, the target

803
00:31:00,639 --> 00:31:02,200
variable columns, and then I further break

804
00:31:02,200 --> 00:31:04,240
it into a training dataset and a test

805
00:31:04,240 --> 00:31:06,600
dataset. The training dataset is to use

806
00:31:06,600 --> 00:31:08,320
to train, to create the machine learning

807
00:31:08,320 --> 00:31:10,480
model. And then once the machine learning

808
00:31:10,480 --> 00:31:12,200
model is created, I then use the test

809
00:31:12,200 --> 00:31:15,080
dataset to evaluate the accuracy of the

810
00:31:15,080 --> 00:31:17,259
machine learning model.

811
00:31:17,259 --> 00:31:21,000
All right. And then finally we can

812
00:31:21,000 --> 00:31:23,200
see what are the different parts or

813
00:31:23,200 --> 00:31:26,080
aspects that go into a successful model,

814
00:31:26,080 --> 00:31:29,519
so EDA about 10%, data cleansing about

815
00:31:29,519 --> 00:31:32,360
20%, feature engineering about

816
00:31:32,360 --> 00:31:36,320
25%, selecting a specific algorithm about

817
00:31:36,320 --> 00:31:39,120
10%, and then training the model from

818
00:31:39,120 --> 00:31:41,639
that algorithm about 15%, and then

819
00:31:41,639 --> 00:31:43,679
finally evaluating the model, deciding

820
00:31:43,679 --> 00:31:45,960
which is the best model with the highest

821
00:31:45,960 --> 00:31:51,819
accuracy rate, that's about 20%.

822
00:31:54,080 --> 00:31:56,919
All right, so we have reached the

823
00:31:56,919 --> 00:31:58,880
most interesting part of this

824
00:31:58,880 --> 00:32:01,039
presentation which is the demonstration

825
00:32:01,039 --> 00:32:03,760
of an end-to-end machine learning workflow

826
00:32:03,760 --> 00:32:06,080
on a real life dataset that

827
00:32:06,080 --> 00:32:10,080
demonstrates the use case of predictive

828
00:32:10,080 --> 00:32:13,519
maintenance. So for the data set for

829
00:32:13,519 --> 00:32:16,240
this particular use case, I've used a

830
00:32:16,240 --> 00:32:19,200
data set from Kaggle. So for those of you

831
00:32:19,200 --> 00:32:21,399
are not aware of this, Kaggle is the

832
00:32:21,399 --> 00:32:24,880
world's largest open-source community

833
00:32:24,880 --> 00:32:28,080
for data science and AI, and they have a

834
00:32:28,080 --> 00:32:31,159
large collection of datasets from all

835
00:32:31,159 --> 00:32:34,440
various areas of industry and human

836
00:32:34,440 --> 00:32:37,039
endeavor, and they also have a large

837
00:32:37,039 --> 00:32:38,840
collection of models that have been

838
00:32:38,840 --> 00:32:42,880
developed using these data sets. So here

839
00:32:42,880 --> 00:32:47,039
we have a data set for the particular

840
00:32:47,039 --> 00:32:50,519
use case, predictive maintenance, okay? So

841
00:32:50,519 --> 00:32:52,919
this is some information about the data

842
00:32:52,919 --> 00:32:56,440
set, so in case you do not know how

843
00:32:56,440 --> 00:32:59,200
to get to there, this is the URL to click

844
00:32:59,200 --> 00:33:02,240
on, okay, to get to that dataset. So once

845
00:33:02,240 --> 00:33:05,120
your at the data set here, you can- or the

846
00:33:05,120 --> 00:33:07,399
page for about this dataset, you can see

847
00:33:07,399 --> 00:33:09,960
all the information about this data set,

848
00:33:09,960 --> 00:33:12,959
and you can download the data set in a

849
00:33:12,959 --> 00:33:14,159
CSV format.

850
00:33:14,159 --> 00:33:16,360
Okay, so let's take a look at the

851
00:33:16,360 --> 00:33:19,559
dataset. So this dataset has a total of

852
00:33:19,559 --> 00:33:23,440
10,000 samples, okay? And these are the

853
00:33:23,440 --> 00:33:26,279
feature variables, the type, the product

854
00:33:26,279 --> 00:33:28,440
ID, the air temperature, process

855
00:33:28,440 --> 00:33:30,899
temperature, rotational speed, torque, tool

856
00:33:30,899 --> 00:33:34,799
wear, and this is the target variable,

857
00:33:34,799 --> 00:33:36,720
all right? So the target variable is what

858
00:33:36,720 --> 00:33:38,159
we are interested in, what we are

859
00:33:38,159 --> 00:33:40,960
interested in using to train the machine

860
00:33:40,960 --> 00:33:42,600
learning model, and also what we are

861
00:33:42,600 --> 00:33:45,279
interested to predict, okay? So these are

862
00:33:45,279 --> 00:33:47,960
the feature variables, they describe or

863
00:33:47,960 --> 00:33:49,960
they provide information about this

864
00:33:49,960 --> 00:33:52,880
particular machine on the production

865
00:33:52,880 --> 00:33:55,080
line, on the assembly line, so you might

866
00:33:55,080 --> 00:33:56,799
know the product ID, the type, the air

867
00:33:56,799 --> 00:33:58,120
temperature, process temperature,

868
00:33:58,120 --> 00:34:00,480
rotational speed, torque, tool wear, right? So

869
00:34:00,480 --> 00:34:03,159
let's say you've got a IoT sensor system

870
00:34:03,159 --> 00:34:06,120
that's basically capturing all this data

871
00:34:06,120 --> 00:34:08,359
about a product or a machine on your

872
00:34:08,359 --> 00:34:10,679
production or assembly line, okay? And

873
00:34:10,679 --> 00:34:13,918
you've also captured information about

874
00:34:13,918 --> 00:34:17,199
whether is for a specific sample,

875
00:34:17,199 --> 00:34:19,839
whether that sample experience a

876
00:34:19,839 --> 00:34:23,040
failure or not, okay? So the target value

877
00:34:23,040 --> 00:34:25,520
of zero, okay, indicates that there's no

878
00:34:25,520 --> 00:34:28,000
failure. So zero means no failure, and we

879
00:34:28,000 --> 00:34:30,199
can see that the vast majority of data

880
00:34:30,199 --> 00:34:32,520
points in this data set are no failure.

881
00:34:32,520 --> 00:34:34,000
And here we can see an example here

882
00:34:34,000 --> 00:34:36,719
where you have a case of a failure, so a

883
00:34:36,719 --> 00:34:40,159
failure is marked as a one, positive, and

884
00:34:40,159 --> 00:34:42,639
no failure is marked as zero, negative,

885
00:34:42,639 --> 00:34:44,879
all right? So here we have one type of a

886
00:34:44,879 --> 00:34:47,040
failure, it's called a power failure. And

887
00:34:47,040 --> 00:34:49,000
if you scroll down the data set, you see

888
00:34:49,000 --> 00:34:50,399
there are also other kinds of failures

889
00:34:50,399 --> 00:34:52,839
like a tool wear

890
00:34:52,839 --> 00:34:56,960
failure, we have a overstrain failure

891
00:34:56,960 --> 00:34:58,680
here, for example,

892
00:34:58,680 --> 00:35:00,760
we also have a power failure again,

893
00:35:00,760 --> 00:35:02,200
and so on. So if you scroll down through

894
00:35:02,200 --> 00:35:04,160
these 10,000 data points, or if

895
00:35:04,160 --> 00:35:06,040
you're familiar with using Excel to

896
00:35:06,040 --> 00:35:08,839
filter out values in a column, you can

897
00:35:08,839 --> 00:35:12,280
see that in this particular column here

898
00:35:12,280 --> 00:35:14,480
which is the so-called target variable

899
00:35:14,480 --> 00:35:16,960
column, you are going to have the vast

900
00:35:16,960 --> 00:35:18,920
majority of values as zero which means

901
00:35:18,920 --> 00:35:22,760
no failure, and some of the rows or the

902
00:35:22,760 --> 00:35:24,040
data points you are going to have a

903
00:35:24,040 --> 00:35:26,359
value of one, and for those rows that you

904
00:35:26,359 --> 00:35:28,119
have a value of one, for example,

905
00:35:28,119 --> 00:35:31,280
here you are- Sorry, for example, here you

906
00:35:31,280 --> 00:35:32,839
are going to have different types of

907
00:35:32,839 --> 00:35:34,640
failures, so like I said just now power

908
00:35:34,640 --> 00:35:38,960
failure, tool set failure, etc, etc. So we are

909
00:35:38,960 --> 00:35:40,640
going to go through the entire machine

910
00:35:40,640 --> 00:35:43,759
learning workflow process with this dataset.

911
00:35:43,759 --> 00:35:46,640
So to see an example of that, we are

912
00:35:46,640 --> 00:35:50,400
going to use a- we're going to go to the

913
00:35:50,400 --> 00:35:52,280
code section here, all right, so if I

914
00:35:52,280 --> 00:35:54,280
click on the code section here. And right

915
00:35:54,280 --> 00:35:56,400
down here we have see what is called a

916
00:35:56,400 --> 00:35:59,359
dataset notebook. So this is basically a

917
00:35:59,359 --> 00:36:02,319
Jupyter notebook. Jupyter is basically an

918
00:36:02,319 --> 00:36:05,280
Python application which allows you to

919
00:36:05,280 --> 00:36:09,240
create a Python machine learning

920
00:36:09,240 --> 00:36:11,680
program that basically builds your

921
00:36:11,680 --> 00:36:14,520
machine learning model, assesses or

922
00:36:14,520 --> 00:36:16,480
evaluates its accuracy, and generates

923
00:36:16,480 --> 00:36:19,040
predictions from it, okay? So here we have

924
00:36:19,040 --> 00:36:21,680
a whole bunch of Jupyter notebooks that

925
00:36:21,680 --> 00:36:24,560
are available, and you can select any one

926
00:36:24,560 --> 00:36:26,000
of them. All these notebooks are

927
00:36:26,000 --> 00:36:28,720
essentially going to process the data

928
00:36:28,720 --> 00:36:31,720
from this particular dataset. So if I go

929
00:36:31,720 --> 00:36:34,720
to this code page here, I've actually

930
00:36:34,720 --> 00:36:37,319
selected a specific notebook that I'm

931
00:36:37,319 --> 00:36:39,960
going to run through to demonstrate an

932
00:36:39,960 --> 00:36:42,839
end-to-end machine learning workflow using

933
00:36:42,839 --> 00:36:45,560
various machine learning libraries from

934
00:36:45,560 --> 00:36:49,800
the Python programming language, okay? So

935
00:36:49,800 --> 00:36:52,440
the particular notebook I'm going to

936
00:36:52,440 --> 00:36:55,160
use is this particular notebook here, and

937
00:36:55,160 --> 00:36:57,160
you can also get the URL for that

938
00:36:57,160 --> 00:37:00,440
particular notebook from here.

939
00:37:00,440 --> 00:37:03,760
Okay, so let's quickly do a quick

940
00:37:03,760 --> 00:37:05,974
revision again. What are we trying to do

941
00:37:05,974 --> 00:37:08,000
here? We're trying to build a machine

942
00:37:08,000 --> 00:37:11,359
learning classification model, right? So

943
00:37:11,359 --> 00:37:12,960
we said there are two primary areas of

944
00:37:12,960 --> 00:37:14,560
supervised learning, one is regression

945
00:37:14,560 --> 00:37:16,200
which is used to predict a numerical

946
00:37:16,200 --> 00:37:18,640
target variable, and the second kind of

947
00:37:18,640 --> 00:37:21,359
supervised learning is classification

948
00:37:21,359 --> 00:37:23,079
which is what we're doing here. We're

949
00:37:23,079 --> 00:37:25,839
trying to predict a categorical target

950
00:37:25,839 --> 00:37:29,680
variable, okay? So in this particular

951
00:37:29,680 --> 00:37:32,119
example, we actually have two kinds of

952
00:37:32,119 --> 00:37:34,480
ways we can classify, either a binary

953
00:37:34,480 --> 00:37:36,679
classification or a multiclass

954
00:37:36,679 --> 00:37:39,520
classification. So for binary

955
00:37:39,520 --> 00:37:41,440
classification, we are only going to

956
00:37:41,440 --> 00:37:43,400
classify the product or machine as

957
00:37:43,400 --> 00:37:47,160
either it failed or it did not fail, okay?

958
00:37:47,160 --> 00:37:48,880
So if we go back to the dataset that I

959
00:37:48,880 --> 00:37:50,839
showed you just now, if you look at this

960
00:37:50,839 --> 00:37:52,680
target variable column, there are only

961
00:37:52,680 --> 00:37:54,520
two possible values here. They are either

962
00:37:54,520 --> 00:37:58,280
zero or one. Zero means there's no failure.

963
00:37:58,280 --> 00:38:01,240
One means there's a failure, okay? So this

964
00:38:01,240 --> 00:38:03,440
is an example of a binary classification.

965
00:38:03,440 --> 00:38:07,240
Only two possible outcomes, zero or one,

966
00:38:07,240 --> 00:38:10,119
didn't fail or fail, all right? Two

967
00:38:10,119 --> 00:38:13,079
possible outcomes. And then we can also,

968
00:38:13,079 --> 00:38:15,480
for the same dataset, we can extend it

969
00:38:15,480 --> 00:38:18,079
and make it a multiclass classification

970
00:38:18,079 --> 00:38:20,880
problem, all right? So if we kind of want

971
00:38:20,880 --> 00:38:23,720
to drill down further, we can say that

972
00:38:23,720 --> 00:38:26,800
not only is there a failure, we can

973
00:38:26,800 --> 00:38:29,200
actually say there are different types of

974
00:38:29,200 --> 00:38:32,440
failures, okay? So we have one category of

975
00:38:32,440 --> 00:38:35,599
class that is basically no failure, okay?

976
00:38:35,599 --> 00:38:37,400
Then we have a category for the

977
00:38:37,400 --> 00:38:40,400
different types of failures, right? So you

978
00:38:40,400 --> 00:38:43,920
can have a power failure, you could have

979
00:38:43,920 --> 00:38:46,400
a tool wear failure,

980
00:38:46,400 --> 00:38:48,920
you could have- let's go down

981
00:38:48,920 --> 00:38:50,880
here, you could have a overstrain

982
00:38:50,880 --> 00:38:53,760
failure, and etc, etc. So you can have

983
00:38:53,760 --> 00:38:57,160
multiple classes of failure in addition

984
00:38:57,160 --> 00:39:00,520
to the general overall or the majority

985
00:39:00,520 --> 00:39:04,319
class of no failure, and that would be a

986
00:39:04,319 --> 00:39:06,680
multiclass classification problem. So

987
00:39:06,680 --> 00:39:08,400
with this data set, we are going to see

988
00:39:08,400 --> 00:39:11,040
how to make it a binary classification

989
00:39:11,040 --> 00:39:12,800
problem and also a multiclass

990
00:39:12,800 --> 00:39:15,079
classification problem. Okay, so let's

991
00:39:15,079 --> 00:39:16,880
look at the workflow. So let's say we've

992
00:39:16,880 --> 00:39:18,880
already got the data, so right now we do

993
00:39:18,880 --> 00:39:20,839
have the dataset. This is the dataset

994
00:39:20,839 --> 00:39:22,720
that we have, so let's assume we've

995
00:39:22,720 --> 00:39:24,560
somehow managed to get this dataset

996
00:39:24,560 --> 00:39:26,880
from some IoT sensors that are

997
00:39:26,880 --> 00:39:29,119
monitoring real-time data in our

998
00:39:29,119 --> 00:39:31,079
production environment. On the assembly

999
00:39:31,079 --> 00:39:32,800
line, on the production line we've got

1000
00:39:32,800 --> 00:39:34,680
sensors reading data that gives us all

1001
00:39:34,680 --> 00:39:37,960
these data that we have in this CSV file.

1002
00:39:37,960 --> 00:39:40,079
Okay, so we've already got the data, we've

1003
00:39:40,079 --> 00:39:41,599
retrieved the data, now we're going to go

1004
00:39:41,599 --> 00:39:45,000
on to the cleaning and exploration part

1005
00:39:45,000 --> 00:39:47,520
of your machine learning life cycle. All

1006
00:39:47,520 --> 00:39:49,800
right, so let's look at the data cleaning

1007
00:39:49,800 --> 00:39:51,400
part. So the data cleaning part, we're

1008
00:39:51,400 --> 00:39:53,720
interested in checking for missing

1009
00:39:53,720 --> 00:39:56,200
values and maybe removing the rows you

1010
00:39:56,200 --> 00:39:58,079
missing values, okay?

1011
00:39:58,079 --> 00:39:59,760
So the kind of things we can- sorry,

1012
00:39:59,760 --> 00:40:01,000
the kind of things we can do in missing

1013
00:40:01,000 --> 00:40:02,880
values, we can remove the rows missing

1014
00:40:02,880 --> 00:40:05,839
values, we can put in some new values,

1015
00:40:05,839 --> 00:40:08,000
some replacement values which could be a

1016
00:40:08,000 --> 00:40:09,880
average of all the values in that that

1017
00:40:09,880 --> 00:40:12,880
particular column, etc, etc, we could also try to

1018
00:40:12,880 --> 00:40:15,480
identify outliers in our data set and

1019
00:40:15,480 --> 00:40:17,480
also there are a variety of ways to deal

1020
00:40:17,480 --> 00:40:19,480
with that. So this is called data

1021
00:40:19,480 --> 00:40:21,359
cleansing which is a really important

1022
00:40:21,359 --> 00:40:23,319
part of your machine learning workflow,

1023
00:40:23,319 --> 00:40:25,520
right? So that's where we are now at,

1024
00:40:25,520 --> 00:40:26,839
we're doing cleansing, and then we're

1025
00:40:26,839 --> 00:40:27,939
going to follow up with

1026
00:40:27,939 --> 00:40:31,160
exploration. So let's look at the actual

1027
00:40:31,160 --> 00:40:33,160
code that does the cleansing here. So

1028
00:40:33,160 --> 00:40:35,800
here we are right at the start of the

1029
00:40:35,800 --> 00:40:38,248
machine learning life cycle here, so

1030
00:40:38,248 --> 00:40:40,839
this is a Jupyter notebook. So here we

1031
00:40:40,839 --> 00:40:43,359
have a brief description of the problem

1032
00:40:43,359 --> 00:40:45,920
statement, all right? So this dataset

1033
00:40:45,920 --> 00:40:47,640
reflects real life predictive

1034
00:40:47,640 --> 00:40:49,240
maintenance encountered industry with

1035
00:40:49,240 --> 00:40:50,480
measurements from real equipment. The

1036
00:40:50,480 --> 00:40:52,400
features description is taken directly

1037
00:40:52,400 --> 00:40:54,520
from the data source set. So here we have

1038
00:40:54,520 --> 00:40:57,400
a description of the six key features in

1039
00:40:57,400 --> 00:40:59,599
our dataset type which is the quality

1040
00:40:59,599 --> 00:41:02,520
of the product, the air temperature, the

1041
00:41:02,520 --> 00:41:04,680
process temperature, the rotational speed,

1042
00:41:04,680 --> 00:41:06,599
the torque, and the tool wear, all right? So

1043
00:41:06,599 --> 00:41:08,880
these are the six feature variables, and

1044
00:41:08,880 --> 00:41:11,319
there are the two target variables, so

1045
00:41:11,319 --> 00:41:13,119
just now- I showed you just now there's

1046
00:41:13,119 --> 00:41:15,119
one target variable which only has two

1047
00:41:15,119 --> 00:41:17,440
possible values, either zero or one, okay?

1048
00:41:17,440 --> 00:41:20,079
Zero or one means failure or no failure,

1049
00:41:20,079 --> 00:41:23,079
so that will be this column here, right?

1050
00:41:23,079 --> 00:41:24,880
So let me go all the way back up to here.

1051
00:41:24,880 --> 00:41:26,640
So this column here, we already saw it

1052
00:41:26,640 --> 00:41:29,440
only has two possible values, it's either zero or

1053
00:41:29,440 --> 00:41:32,680
one. And then we also have this column

1054
00:41:32,680 --> 00:41:35,040
here, and this column here is basically

1055
00:41:35,040 --> 00:41:38,079
the failure type. And so the- we have- as I

1056
00:41:38,079 --> 00:41:40,800
already demonstrated just now, we do have

1057
00:41:40,800 --> 00:41:43,440
several categories of types of

1058
00:41:43,440 --> 00:41:45,560
failure, and so here we call this

1059
00:41:45,560 --> 00:41:46,235
multiclass

1060
00:41:46,235 --> 00:41:50,000
classification. So we can either build a

1061
00:41:50,000 --> 00:41:51,839
binary classification model for this

1062
00:41:51,839 --> 00:41:53,520
problem domain, or we can build a

1063
00:41:53,520 --> 00:41:54,491
multiclass

1064
00:41:54,491 --> 00:41:58,119
classification problem, all right. So this

1065
00:41:58,119 --> 00:41:59,839
Jupyter notebook is going to demonstrate

1066
00:41:59,839 --> 00:42:02,319
both approaches to us. So first step, we

1067
00:42:02,319 --> 00:42:04,800
are going to write all this Python code

1068
00:42:04,800 --> 00:42:06,880
that's going to import all the libraries

1069
00:42:06,880 --> 00:42:09,079
that we need to use, okay? So this is

1070
00:42:09,079 --> 00:42:12,319
basically Python code, okay, and it's

1071
00:42:12,319 --> 00:42:15,119
importing the relevant machine learn-

1072
00:42:15,119 --> 00:42:17,960
oops. We are importing the relevant

1073
00:42:17,960 --> 00:42:20,599
machine learning libraries related to

1074
00:42:20,599 --> 00:42:23,520
our domain use case, okay? Then we load in

1075
00:42:23,520 --> 00:42:26,440
our dataset, okay, so this our dataset.

1076
00:42:26,440 --> 00:42:28,319
We describe it, we have some quick

1077
00:42:28,319 --> 00:42:30,920
insights into the dataset. And then

1078
00:42:30,920 --> 00:42:32,839
we just take a look at all the variables

1079
00:42:32,839 --> 00:42:36,000
of the feature variables, etc, and so on.

1080
00:42:36,000 --> 00:42:38,000
What we're doing now is just

1081
00:42:38,000 --> 00:42:39,800
doing a quick overview of the dataset,

1082
00:42:39,800 --> 00:42:41,559
so this all this Python code here that

1083
00:42:41,559 --> 00:42:43,760
we're writing is allowing us, the data

1084
00:42:43,760 --> 00:42:45,359
scientist, to get a quick overview of our

1085
00:42:45,359 --> 00:42:48,209
dataset, right, okay, like how many varia-

1086
00:42:48,209 --> 00:42:50,240
how many rows are there, how many columns

1087
00:42:50,240 --> 00:42:51,760
are there, what are the data types of the

1088
00:42:51,760 --> 00:42:53,440
columns, what are the name of the columns,

1089
00:42:53,440 --> 00:42:57,359
etc, etc. Okay, then we zoom in on to the

1090
00:42:57,359 --> 00:42:58,839
target variables. So we look at the

1091
00:42:58,839 --> 00:43:02,000
target variables, how many counts

1092
00:43:02,000 --> 00:43:04,520
there are of this target variable, and

1093
00:43:04,520 --> 00:43:06,440
so on. How many different types of

1094
00:43:06,440 --> 00:43:08,240
failures there are. Then you want to

1095
00:43:08,240 --> 00:43:09,000
check whether there are any

1096
00:43:09,000 --> 00:43:10,760
inconsistencies between the target and

1097
00:43:10,760 --> 00:43:13,559
the failure type, etc. Okay, so when you do

1098
00:43:13,559 --> 00:43:15,119
all this checking, you're going to

1099
00:43:15,119 --> 00:43:16,960
discover there are some discrepancies in

1100
00:43:16,960 --> 00:43:20,280
your dataset, so using a specific Python

1101
00:43:20,280 --> 00:43:21,839
code to do checking, you're going to say

1102
00:43:21,839 --> 00:43:23,480
hey, you know what? There's some errors

1103
00:43:23,480 --> 00:43:25,000
here, right? There are nine values that

1104
00:43:25,000 --> 00:43:26,599
classify as failure in target variable,

1105
00:43:26,599 --> 00:43:28,200
but as no failure in the failure type

1106
00:43:28,200 --> 00:43:29,720
variable, so that means there's a

1107
00:43:29,720 --> 00:43:33,200
discrepancy in your data point, right?

1108
00:43:33,200 --> 00:43:34,760
So these are all the ones that

1109
00:43:34,760 --> 00:43:36,359
are discrepancies because the target

1110
00:43:36,359 --> 00:43:39,000
variable says one, and we already know

1111
00:43:39,000 --> 00:43:41,240
that target variable one is supposed to

1112
00:43:41,240 --> 00:43:43,099
mean there is a failure, right? Target

1113
00:43:43,099 --> 00:43:44,880
variable one is supposed to mean there is

1114
00:43:44,880 --> 00:43:47,119
a failure, so we are kind of expecting to

1115
00:43:47,119 --> 00:43:49,680
see the failure classification, but some

1116
00:43:49,680 --> 00:43:51,400
rows actually say there's no failure

1117
00:43:51,400 --> 00:43:53,800
although the target type is one. Well here

1118
00:43:53,800 --> 00:43:55,920
is a classic example of an error that

1119
00:43:55,920 --> 00:43:58,640
can very well occur in a dataset, so now

1120
00:43:58,640 --> 00:44:00,559
the question is what do you do with

1121
00:44:00,559 --> 00:44:04,720
these errors in your dataset, right? So

1122
00:44:04,720 --> 00:44:06,240
here the data scientist says, I think it

1123
00:44:06,240 --> 00:44:07,520
would make sense to remove those

1124
00:44:07,520 --> 00:44:09,920
instances, and so they write some code

1125
00:44:09,920 --> 00:44:12,680
then to remove those instances or those

1126
00:44:12,680 --> 00:44:14,920
rows or data points from the overall

1127
00:44:14,920 --> 00:44:17,280
data set, and same thing we can, again,

1128
00:44:17,280 --> 00:44:19,240
check for other issues. So we find there's

1129
00:44:19,240 --> 00:44:21,160
another issue here with our data set which

1130
00:44:21,160 --> 00:44:24,079
is another warning, so, again, we can

1131
00:44:24,079 --> 00:44:26,240
possibly remove them. So you're going to

1132
00:44:26,240 --> 00:44:31,280
remove 27 instances or rows from your

1133
00:44:31,280 --> 00:44:34,440
overall data set. So your data set has

1134
00:44:34,440 --> 00:44:37,079
10,000 rows or data points. You're

1135
00:44:37,079 --> 00:44:40,160
removing 27 which is only 0.27 of the

1136
00:44:40,160 --> 00:44:42,240
entire dataset. And these were the

1137
00:44:42,240 --> 00:44:45,720
reasons why you removed them, okay? So if

1138
00:44:45,720 --> 00:44:48,160
you're just removing 0.27% of the

1139
00:44:48,160 --> 00:44:50,800
entire dataset, no big deal, right? Still

1140
00:44:50,800 --> 00:44:53,079
okay, but you needed to remove them

1141
00:44:53,079 --> 00:44:55,000
because these errors right, these

1142
00:44:55,000 --> 00:44:58,040
27

1143
00:44:58,040 --> 00:45:00,559
errors, okay, data points with errors in

1144
00:45:00,559 --> 00:45:02,960
your dataset could really affect the

1145
00:45:02,960 --> 00:45:05,000
training of your machine learning model.

1146
00:45:05,000 --> 00:45:08,640
So we need to do your data cleansing,

1147
00:45:08,640 --> 00:45:11,720
right? So we are actually cleansing now

1148
00:45:11,720 --> 00:45:15,200
some kind of data that is

1149
00:45:15,200 --> 00:45:17,520
incorrect or erroneous in your original

1150
00:45:17,520 --> 00:45:21,440
dataset. Okay, so then we go on to the

1151
00:45:21,440 --> 00:45:23,839
next part which is called EDA, right? So

1152
00:45:23,839 --> 00:45:28,880
EDA is where we kind of explore our data,

1153
00:45:28,880 --> 00:45:31,720
and we want to, kind of, get a visual

1154
00:45:31,720 --> 00:45:34,240
overview of our data as a whole, and also

1155
00:45:34,240 --> 00:45:35,880
take a look at the statistical

1156
00:45:35,880 --> 00:45:38,160
properties of our data. The statistical

1157
00:45:38,160 --> 00:45:40,480
distribution of the data in all the

1158
00:45:40,480 --> 00:45:43,079
various columns, the correlation between

1159
00:45:43,079 --> 00:45:44,640
the variables, between the feature

1160
00:45:44,640 --> 00:45:46,680
variables different columns, and also the

1161
00:45:46,680 --> 00:45:48,599
feature variable and the target variable.

1162
00:45:48,599 --> 00:45:52,040
So all of this is called EDA, and EDA in

1163
00:45:52,040 --> 00:45:54,079
a machine learning workflow is typically

1164
00:45:54,079 --> 00:45:57,160
done through visualization,

1165
00:45:57,160 --> 00:45:58,839
all right? So let's go back here and take

1166
00:45:58,839 --> 00:46:00,599
a look, right? So, for example, here we are

1167
00:46:00,599 --> 00:46:03,400
looking at correlation, so we plot the

1168
00:46:03,400 --> 00:46:05,680
values of all the various feature

1169
00:46:05,680 --> 00:46:07,599
variables against each other and look

1170
00:46:07,599 --> 00:46:10,800
for potential correlations and patterns

1171
00:46:10,800 --> 00:46:13,359
and so on. And all the different shapes

1172
00:46:13,359 --> 00:46:17,280
that you see here in this pair plot, okay,

1173
00:46:17,280 --> 00:46:18,400
will have different meaning,

1174
00:46:18,400 --> 00:46:20,000
statistical meaning, and so the data

1175
00:46:20,000 --> 00:46:21,800
scientist has to, kind of, visually

1176
00:46:21,800 --> 00:46:23,760
inspect this pair plot, make some

1177
00:46:23,760 --> 00:46:25,559
interpretations of these different

1178
00:46:25,559 --> 00:46:27,680
patterns that he sees here, all right. So

1179
00:46:27,680 --> 00:46:30,480
these are some of the insights that

1180
00:46:30,480 --> 00:46:32,839
can be deduced from looking at these

1181
00:46:32,839 --> 00:46:34,319
patterns, so, for example, the torque and

1182
00:46:34,319 --> 00:46:36,280
rotational speed are highly correlated,

1183
00:46:36,280 --> 00:46:38,040
the process temperature and air

1184
00:46:38,040 --> 00:46:39,920
temperature also highly correlated, that

1185
00:46:39,920 --> 00:46:41,559
failures occur for extreme values of

1186
00:46:41,559 --> 00:46:44,520
some features, etc, etc. Then you can plot

1187
00:46:44,520 --> 00:46:45,960
certain kinds of charts. This called a

1188
00:46:45,960 --> 00:46:48,480
violin chart to, again, get new insights.

1189
00:46:48,480 --> 00:46:49,839
For example, regarding the torque and

1190
00:46:49,839 --> 00:46:51,480
rotational speed, it can see, again, that

1191
00:46:51,480 --> 00:46:53,119
most failures are triggered for much

1192
00:46:53,119 --> 00:46:55,119
lower or much higher values than the

1193
00:46:55,119 --> 00:46:57,400
mean when they're not failing. So all

1194
00:46:57,400 --> 00:47:00,720
these visualizations, they are there, and

1195
00:47:00,720 --> 00:47:02,480
a trained data scientist can look at

1196
00:47:02,480 --> 00:47:05,079
them, inspect them, and make some kind of

1197
00:47:05,079 --> 00:47:08,400
insightful deductions from them, okay?

1198
00:47:08,400 --> 00:47:11,079
Percentage of failure, right? The

1199
00:47:11,079 --> 00:47:13,640
correlation heat map, okay, between all

1200
00:47:13,640 --> 00:47:15,559
these different feature variables, and

1201
00:47:15,559 --> 00:47:16,430
also the target

1202
00:47:16,430 --> 00:47:19,599
variable, okay? The product types,

1203
00:47:19,599 --> 00:47:21,079
percentage of product types, percentage

1204
00:47:21,079 --> 00:47:23,160
of failure with respect to the product

1205
00:47:23,160 --> 00:47:25,720
type, so we can also kind of visualize

1206
00:47:25,720 --> 00:47:27,800
that as well. So certain products have a

1207
00:47:27,800 --> 00:47:29,839
higher ratio of failure compared to other

1208
00:47:29,839 --> 00:47:33,240
product types, etc. Or, for example, M

1209
00:47:33,240 --> 00:47:35,800
tends to fail more than H products, etc,

1210
00:47:35,800 --> 00:47:38,880
etc. So we can create a vast variety of

1211
00:47:38,880 --> 00:47:41,319
visualizations in the EDA stage, so you

1212
00:47:41,319 --> 00:47:43,960
can see here. And, again, the idea of this

1213
00:47:43,960 --> 00:47:46,359
visualization is just to give us some

1214
00:47:46,359 --> 00:47:49,680
insight, some preliminary insight into

1215
00:47:49,680 --> 00:47:52,520
our dataset that helps us to model it

1216
00:47:52,520 --> 00:47:54,119
more correctly. So some more insights

1217
00:47:54,119 --> 00:47:56,200
that we get into our data set from all

1218
00:47:56,200 --> 00:47:57,599
this visualization.

1219
00:47:57,599 --> 00:47:59,559
Then we can plot the distribution so we

1220
00:47:59,559 --> 00:48:00,720
can see whether it's a normal

1221
00:48:00,720 --> 00:48:02,789
distribution or some other kind of

1222
00:48:02,789 --> 00:48:05,640
distribution. We can have a box plot

1223
00:48:05,640 --> 00:48:07,760
to see whether there are any outliers in

1224
00:48:07,760 --> 00:48:10,400
your data set and so on, right? So we can

1225
00:48:10,400 --> 00:48:11,640
see from the box plots, we can see

1226
00:48:11,640 --> 00:48:14,599
rotational speed and have outliers. So we

1227
00:48:14,599 --> 00:48:16,880
already saw outliers are basically a

1228
00:48:16,880 --> 00:48:18,800
problem that you may need to kind of

1229
00:48:18,800 --> 00:48:22,520
tackle, right? So outliers are an issue,

1230
00:48:22,520 --> 00:48:24,800
it's a part of data cleansing. And

1231
00:48:24,800 --> 00:48:26,960
so you may need to tackle this, so we may

1232
00:48:26,960 --> 00:48:28,880
have to check okay, well where are the

1233
00:48:28,880 --> 00:48:31,319
potential outliers so we can analyze

1234
00:48:31,319 --> 00:48:35,319
them from the box plot, okay? But then

1235
00:48:35,319 --> 00:48:37,079
we can say well they are outliers, but

1236
00:48:37,079 --> 00:48:38,800
maybe they're not really horrible

1237
00:48:38,800 --> 00:48:40,760
outliers so we can tolerate them or

1238
00:48:40,760 --> 00:48:42,880
maybe we want to remove them. So we can

1239
00:48:42,880 --> 00:48:44,920
see what our mean and maximum values for

1240
00:48:44,920 --> 00:48:46,720
all these with respect to product type,

1241
00:48:46,720 --> 00:48:49,680
how many of them are above or highly

1242
00:48:49,680 --> 00:48:51,440
correlated with the product type in

1243
00:48:51,440 --> 00:48:54,240
terms of the maximum and minimum, okay,

1244
00:48:54,240 --> 00:48:56,960
and then so on. So the insight is well we

1245
00:48:56,960 --> 00:48:59,599
got 4.8% of the instances are outliers,

1246
00:48:59,599 --> 00:49:02,559
so maybe 4.87% is not really that much,

1247
00:49:02,559 --> 00:49:04,920
the outliers are not horrible, so we just

1248
00:49:04,920 --> 00:49:06,960
leave them in the dataset. Now for a

1249
00:49:06,960 --> 00:49:08,520
different dataset, the data scientist

1250
00:49:08,520 --> 00:49:10,280
could come to a different conclusion, so

1251
00:49:10,280 --> 00:49:12,280
then they would do whatever they've

1252
00:49:12,280 --> 00:49:15,400
deemed is appropriate to, kind of, cleanse

1253
00:49:15,400 --> 00:49:18,079
the dataset. Okay, so now that we have

1254
00:49:18,079 --> 00:49:20,000
done all the EDA, the next thing we're

1255
00:49:20,000 --> 00:49:23,160
going to do is we are going to do what

1256
00:49:23,160 --> 00:49:26,200
is called feature engineering. So we are

1257
00:49:26,200 --> 00:49:28,760
going to transform our original feature

1258
00:49:28,760 --> 00:49:31,280
variables and these are our original

1259
00:49:31,280 --> 00:49:32,960
feature variables, right? These are our

1260
00:49:32,960 --> 00:49:35,040
original feature variables, and we are

1261
00:49:35,040 --> 00:49:37,760
going to transform them, all right? We're

1262
00:49:37,760 --> 00:49:40,319
going to transform them in some sense

1263
00:49:40,319 --> 00:49:43,760
into some other form before we fit this

1264
00:49:43,760 --> 00:49:45,640
for training into our machine learning

1265
00:49:45,640 --> 00:49:48,599
algorithm, all right? So these are

1266
00:49:48,599 --> 00:49:51,599
examples of- let's say these are examples of a

1267
00:49:51,599 --> 00:49:55,200
original data set, right? And this is

1268
00:49:55,200 --> 00:49:56,839
examples, these are some of the examples,

1269
00:49:56,839 --> 00:49:58,040
you don't have to use all of them, but

1270
00:49:58,040 --> 00:49:59,440
these are some of the examples of what we

1271
00:49:59,440 --> 00:50:00,839
call feature engineering which you can

1272
00:50:00,839 --> 00:50:03,559
then transform your original values in

1273
00:50:03,559 --> 00:50:05,280
your feature variables to all these

1274
00:50:05,280 --> 00:50:07,920
transform values here. So we're going to

1275
00:50:07,920 --> 00:50:09,680
pretty much do that here, so we have a

1276
00:50:09,680 --> 00:50:12,599
ordinal encoding, we do scaling of the

1277
00:50:12,599 --> 00:50:14,839
data so the dataset is scaled, we use a

1278
00:50:14,839 --> 00:50:18,240
MinMax scaling, and then finally, we come

1279
00:50:18,240 --> 00:50:21,720
to do a modeling. So we have to split our

1280
00:50:21,720 --> 00:50:24,359
dataset into a training dataset and a

1281
00:50:24,359 --> 00:50:28,640
test dataset. So coming back to here again,

1282
00:50:28,640 --> 00:50:32,160
we said that before you train your

1283
00:50:32,160 --> 00:50:33,799
model, sorry, before you train your model,

1284
00:50:33,799 --> 00:50:35,599
you have to take your original dataset,

1285
00:50:35,599 --> 00:50:37,319
now this is a featured engineered dataset.

1286
00:50:37,319 --> 00:50:38,839
We're going to break it into two or

1287
00:50:38,839 --> 00:50:40,839
more subsets, okay. So one is called the

1288
00:50:40,839 --> 00:50:42,400
training dataset that we use to feed

1289
00:50:42,400 --> 00:50:44,000
and train a machine learning model. The

1290
00:50:44,000 --> 00:50:45,920
second is test dataset to evaluate the

1291
00:50:45,920 --> 00:50:47,960
accuracy of the model, okay? So we got

1292
00:50:47,960 --> 00:50:50,939
this training dataset, your test dataset,

1293
00:50:50,939 --> 00:50:52,720
and we also need

1294
00:50:52,720 --> 00:50:56,160
to sample. So from our original data set

1295
00:50:56,160 --> 00:50:57,400
we need to sample some points

1296
00:50:57,400 --> 00:50:58,839
that go into your training dataset, some

1297
00:50:58,839 --> 00:51:00,559
points that go in your test dataset. So

1298
00:51:00,559 --> 00:51:02,720
there are many ways to do sampling. One

1299
00:51:02,720 --> 00:51:04,920
way is to do stratified sampling where

1300
00:51:04,920 --> 00:51:06,720
we ensure the same proportion of data

1301
00:51:06,720 --> 00:51:09,000
from each stata or class because right

1302
00:51:09,000 --> 00:51:10,960
now we have a multiclass classification

1303
00:51:10,960 --> 00:51:12,319
problem, so you want to make sure the

1304
00:51:12,319 --> 00:51:13,960
same proportion of data from each strata or

1305
00:51:13,960 --> 00:51:15,839
class is equally proportional in the

1306
00:51:15,839 --> 00:51:17,920
training and test dataset as the

1307
00:51:17,920 --> 00:51:20,119
original dataset which is very useful

1308
00:51:20,119 --> 00:51:21,640
for dealing with what is called an

1309
00:51:21,640 --> 00:51:24,319
imbalanced dataset. So here we have an

1310
00:51:24,319 --> 00:51:25,839
example of what is called an imbalanced

1311
00:51:25,839 --> 00:51:29,520
dataset in the sense that you have the

1312
00:51:29,520 --> 00:51:32,760
vast majority of data points in your

1313
00:51:32,760 --> 00:51:34,960
data set, they are going to have the

1314
00:51:34,960 --> 00:51:37,480
value of zero for their target variable

1315
00:51:37,480 --> 00:51:40,200
column. So only a extremely small

1316
00:51:40,200 --> 00:51:43,443
minority of the data points in your dataset

1317
00:51:43,443 --> 00:51:45,319
will actually have the value of one

1318
00:51:45,319 --> 00:51:48,720
for their target variable column, okay? So

1319
00:51:48,720 --> 00:51:51,040
a situation where you have your class or

1320
00:51:51,040 --> 00:51:52,520
your target variable column where the

1321
00:51:52,520 --> 00:51:54,480
vast majority of values are from one

1322
00:51:54,480 --> 00:51:58,119
class and a tiny small minority are from

1323
00:51:58,119 --> 00:52:00,520
another class, we call this an imbalanced

1324
00:52:00,520 --> 00:52:02,720
dataset. And for an imbalanced dataset,

1325
00:52:02,720 --> 00:52:04,319
typically we will have a specific

1326
00:52:04,319 --> 00:52:05,920
technique to do the train test split

1327
00:52:05,920 --> 00:52:08,119
which is called stratified sampling, and

1328
00:52:08,119 --> 00:52:09,599
so that's what's exactly happening here.

1329
00:52:09,599 --> 00:52:12,000
We're doing a stratified split here, so

1330
00:52:12,000 --> 00:52:14,839
we are doing a train test split here,

1331
00:52:14,839 --> 00:52:17,520
and we are doing a stratified split.

1332
00:52:17,520 --> 00:52:20,359
And then now we actually develop the

1333
00:52:20,359 --> 00:52:23,359
models. So now we've got the train test

1334
00:52:23,359 --> 00:52:25,480
split, now here is where we actually

1335
00:52:25,480 --> 00:52:27,079
train the models.

1336
00:52:27,079 --> 00:52:29,920
Now in terms of classification there are

1337
00:52:29,920 --> 00:52:31,299
a whole bunch of

1338
00:52:31,299 --> 00:52:35,400
possibilities, right, that you can use.

1339
00:52:35,400 --> 00:52:38,480
There are many, many different algorithms

1340
00:52:38,480 --> 00:52:41,000
that we can use to create a

1341
00:52:41,000 --> 00:52:42,839
classification model. So these are an

1342
00:52:42,839 --> 00:52:45,079
example of some of the more common ones.

1343
00:52:45,079 --> 00:52:47,480
Logistic, support vector machine, decision

1344
00:52:47,480 --> 00:52:49,520
trees, random forest, bagging, balanced

1345
00:52:49,520 --> 00:52:52,720
bagging, boost, ensemble. So all

1346
00:52:52,720 --> 00:52:55,040
these are different algorithms which

1347
00:52:55,040 --> 00:52:57,760
will create different kinds of models

1348
00:52:57,760 --> 00:53:01,599
which will result in different accuracy

1349
00:53:01,599 --> 00:53:05,400
measures, okay? So it's the goal of the

1350
00:53:05,400 --> 00:53:08,920
data scientist to find the best model

1351
00:53:08,920 --> 00:53:11,520
that gives the best accuracy for the

1352
00:53:11,520 --> 00:53:14,119
given dataset, for training on that

1353
00:53:14,119 --> 00:53:16,880
given dataset. So let's head back, again,

1354
00:53:16,880 --> 00:53:19,760
to our machine learning workflow. So

1355
00:53:19,760 --> 00:53:21,520
here basically what I'm doing is I'm

1356
00:53:21,520 --> 00:53:23,520
creating a whole bunch of models here,

1357
00:53:23,520 --> 00:53:25,520
all right? So one is a random forest, one

1358
00:53:25,520 --> 00:53:27,160
is balanced bagging, one is a boost

1359
00:53:27,160 --> 00:53:29,520
classifier, one's a ensemble classifier,

1360
00:53:29,520 --> 00:53:32,760
and using all of these, I am going to

1361
00:53:32,760 --> 00:53:35,319
basically feed or train my model using

1362
00:53:35,319 --> 00:53:37,440
all these algorithms. And then I'm going

1363
00:53:37,440 --> 00:53:39,799
to evaluate them, okay? I'm going to

1364
00:53:39,799 --> 00:53:42,480
evaluate how good each of these models

1365
00:53:42,480 --> 00:53:45,760
are. And here you can see your

1366
00:53:45,760 --> 00:53:48,839
evaluation data, right? Okay and this is

1367
00:53:48,839 --> 00:53:50,839
the confusion matrix which is another

1368
00:53:50,839 --> 00:53:54,280
way of evaluating. So now we come to the,

1369
00:53:54,280 --> 00:53:56,319
kind of, the key part here which

1370
00:53:56,319 --> 00:53:58,520
is how do I distinguish between

1371
00:53:58,520 --> 00:54:00,079
all these models, right? I've got all

1372
00:54:00,079 --> 00:54:01,400
these different models which are built

1373
00:54:01,400 --> 00:54:03,040
with different algorithms which I'm

1374
00:54:03,040 --> 00:54:05,359
using to train on the same dataset, how

1375
00:54:05,359 --> 00:54:07,359
do I distinguish between all these

1376
00:54:07,359 --> 00:54:10,359
models, okay? And so for that sense, for

1377
00:54:10,359 --> 00:54:13,880
that we actually have a whole bunch of

1378
00:54:13,880 --> 00:54:16,200
common evaluation metrics for

1379
00:54:16,200 --> 00:54:18,319
classification, right? So this evaluation

1380
00:54:18,319 --> 00:54:22,240
metrics tell us how good a model is in

1381
00:54:22,240 --> 00:54:24,319
terms of its accuracy in

1382
00:54:24,319 --> 00:54:27,000
classification. So in terms of

1383
00:54:27,000 --> 00:54:29,440
accuracy, we actually have many different

1384
00:54:29,440 --> 00:54:31,680
models, sorry, many different measures,

1385
00:54:31,680 --> 00:54:33,440
right? You might think well, accuracy is

1386
00:54:33,440 --> 00:54:35,400
just accuracy, well that's all right, it's

1387
00:54:35,400 --> 00:54:36,880
just either it's accurate or it's not

1388
00:54:36,880 --> 00:54:39,319
accurate, right? But actually it's not

1389
00:54:39,319 --> 00:54:41,359
that simple. There are many different

1390
00:54:41,359 --> 00:54:43,839
ways to measure the accuracy of a

1391
00:54:43,839 --> 00:54:45,480
classification model, and these are some

1392
00:54:45,480 --> 00:54:48,280
of the more common ones. So, for example,

1393
00:54:48,280 --> 00:54:51,000
the confusion matrix tells us how many

1394
00:54:51,000 --> 00:54:54,000
true positives, that means the value is

1395
00:54:54,000 --> 00:54:55,880
positive, the prediction is positive, how

1396
00:54:55,880 --> 00:54:57,520
many false positives which means the

1397
00:54:57,520 --> 00:54:59,040
value is negative the machine learning

1398
00:54:59,040 --> 00:55:01,839
model predicts positive. How many false

1399
00:55:01,839 --> 00:55:03,839
negatives which means that the machine

1400
00:55:03,839 --> 00:55:05,559
learning model predicts negative, but

1401
00:55:05,559 --> 00:55:07,480
it's actually positive. And how many true

1402
00:55:07,480 --> 00:55:09,359
negatives there are which means that the

1403
00:55:09,359 --> 00:55:11,240
the machine learning model

1404
00:55:11,240 --> 00:55:12,880
predicts negative and the true value is

1405
00:55:12,880 --> 00:55:14,760
also negative. So this is called a

1406
00:55:14,760 --> 00:55:16,920
confusion matrix. This is one way we

1407
00:55:16,920 --> 00:55:19,480
assess or evaluate the performance of a

1408
00:55:19,480 --> 00:55:20,520
classification model,

1409
00:55:20,520 --> 00:55:23,319
okay? This is for binary

1410
00:55:23,319 --> 00:55:24,680
classification, we can also have

1411
00:55:24,680 --> 00:55:26,880
multiclass confusion matrix,

1412
00:55:26,880 --> 00:55:29,000
and then we can also measure things like

1413
00:55:29,000 --> 00:55:31,720
accuracy. So accuracy is the true

1414
00:55:31,720 --> 00:55:34,079
positives plus the true negatives which

1415
00:55:34,079 --> 00:55:35,440
is the total number of correct

1416
00:55:35,440 --> 00:55:37,839
predictions made by the model divided by

1417
00:55:37,839 --> 00:55:39,839
the total number of data points in your

1418
00:55:39,839 --> 00:55:42,599
dataset. And then you have also other

1419
00:55:42,599 --> 00:55:43,150
kinds of

1420
00:55:43,150 --> 00:55:46,599
measures such as recall. And this a

1421
00:55:46,599 --> 00:55:49,160
formula for recall, this is a formula for

1422
00:55:49,160 --> 00:55:51,480
the F1 score, okay? And then there's

1423
00:55:51,480 --> 00:55:55,559
something called the ROC curve, right? So

1424
00:55:55,559 --> 00:55:57,039
without going too much in the detail of

1425
00:55:57,039 --> 00:55:59,000
what each of these entails, essentially

1426
00:55:59,000 --> 00:56:00,640
these are all different ways, these are

1427
00:56:00,640 --> 00:56:03,280
different KPI, right? Just like if you

1428
00:56:03,280 --> 00:56:06,119
work in a company, you have different KPI,

1429
00:56:06,119 --> 00:56:08,079
right? Certain employees have certain KPI

1430
00:56:08,079 --> 00:56:11,280
that measures how good or how, you

1431
00:56:11,280 --> 00:56:13,200
know, efficient or how effective a

1432
00:56:13,200 --> 00:56:15,502
particular employee is, right? So the

1433
00:56:15,502 --> 00:56:19,880
KPI for your machine learning models

1434
00:56:19,880 --> 00:56:24,240
are ROC curve, F1 score, recall, accuracy,

1435
00:56:24,240 --> 00:56:26,599
okay, and your confusion matrix. So

1436
00:56:26,599 --> 00:56:29,839
fundamentally after I have built, right,

1437
00:56:29,839 --> 00:56:33,359
so here I've built my four different

1438
00:56:33,359 --> 00:56:35,240
models. So after I built these four

1439
00:56:35,240 --> 00:56:37,640
different models, I'm going to check and

1440
00:56:37,640 --> 00:56:39,680
evaluate them using all those different

1441
00:56:39,680 --> 00:56:42,440
metrics like, for example, the F1 score,

1442
00:56:42,440 --> 00:56:44,839
the precision score, the recall score, all

1443
00:56:44,839 --> 00:56:47,319
right. So for this model, I can check out

1444
00:56:47,319 --> 00:56:50,039
the ROC score, the F1 score, the precision

1445
00:56:50,039 --> 00:56:52,119
score, the recall score. Then for this

1446
00:56:52,119 --> 00:56:54,799
model, this is the ROC score, the F1 score,

1447
00:56:54,799 --> 00:56:56,839
the precision score, the recall score.

1448
00:56:56,839 --> 00:56:59,680
Then for this model and so on. So for

1449
00:56:59,680 --> 00:57:03,240
every single model I've created using my

1450
00:57:03,240 --> 00:57:05,839
training data set, I will have all my set

1451
00:57:05,839 --> 00:57:08,000
of evaluation metrics that I can use to

1452
00:57:08,000 --> 00:57:11,839
evaluate how good this model is, okay?

1453
00:57:11,839 --> 00:57:13,119
Same thing here, I've got a confusion

1454
00:57:13,119 --> 00:57:15,079
matrix here, right, so I can use that,

1455
00:57:15,079 --> 00:57:18,119
again, to evaluate between all these four

1456
00:57:18,119 --> 00:57:20,200
different models, and then I, kind of,

1457
00:57:20,200 --> 00:57:22,240
summarize it up here. So we can see from

1458
00:57:22,240 --> 00:57:25,440
this summary here that actually the top

1459
00:57:25,440 --> 00:57:27,599
two models, right, which are I'm going to

1460
00:57:27,599 --> 00:57:29,440
give a lot, as a data scientist, I'm now

1461
00:57:29,440 --> 00:57:31,119
going to just focus on these two models.

1462
00:57:31,119 --> 00:57:33,440
So these two models are bagging

1463
00:57:33,440 --> 00:57:36,000
classifier and random forest classifier.

1464
00:57:36,000 --> 00:57:38,480
They have the highest values of F1 score,

1465
00:57:38,480 --> 00:57:40,480
and the highest values of the ROC curve

1466
00:57:40,480 --> 00:57:42,640
score, okay? So we can say these are the

1467
00:57:42,640 --> 00:57:45,839
top two models in terms of accuracy, okay,

1468
00:57:45,839 --> 00:57:48,920
using the F1 evaluation metric and the

1469
00:57:48,920 --> 00:57:53,720
ROC AUC evaluation metric, okay? So these

1470
00:57:53,720 --> 00:57:57,480
results, kind of, summarize here, and

1471
00:57:57,480 --> 00:57:59,079
then we use different sampling

1472
00:57:59,079 --> 00:58:00,880
techniques, okay, so just now I talked

1473
00:58:00,880 --> 00:58:03,680
about um different kinds of sampling

1474
00:58:03,680 --> 00:58:06,400
techniques and so the idea of different

1475
00:58:06,400 --> 00:58:08,319
kinds of sampling techniques is to just

1476
00:58:08,319 --> 00:58:11,319
get a different feel for different

1477
00:58:11,319 --> 00:58:13,720
distributions of the data in different

1478
00:58:13,720 --> 00:58:16,359
areas of your data set so that you want

1479
00:58:16,359 --> 00:58:20,000
to just kind of make sure that your your

1480
00:58:20,000 --> 00:58:22,799
your evaluation of accuracy is actually

1481
00:58:22,799 --> 00:58:27,079
statistically correct right so we can um

1482
00:58:27,079 --> 00:58:29,599
do what is called oversampling and under

1483
00:58:29,599 --> 00:58:30,880
sampling which is very useful when

1484
00:58:30,880 --> 00:58:32,280
you're working with an imbalance data

1485
00:58:32,280 --> 00:58:35,039
set so this is example of doing that and

1486
00:58:35,039 --> 00:58:37,240
then here we again again check out the

1487
00:58:37,240 --> 00:58:38,799
results for all these different

1488
00:58:38,799 --> 00:58:41,680
techniques we use uh the F1 score the Au

1489
00:58:41,680 --> 00:58:43,599
score all right these are the two key

1490
00:58:43,599 --> 00:58:46,760
measures of accuracy right so and then

1491
00:58:46,760 --> 00:58:47,920
we can check out the scores for the

1492
00:58:47,920 --> 00:58:50,480
different approaches okay so we can see

1493
00:58:50,480 --> 00:58:53,119
oh well overall the models have lower Au

1494
00:58:53,119 --> 00:58:55,720
r r Au C score but they have a much

1495
00:58:55,720 --> 00:58:58,280
higher F1 score the begging classifier

1496
00:58:58,280 --> 00:59:00,839
had the highest R1 highest roc1 score

1497
00:59:00,839 --> 00:59:04,119
but F1 score was too low okay then in

1498
00:59:04,119 --> 00:59:06,520
the data scientist opinion the random

1499
00:59:06,520 --> 00:59:08,520
forest with this particular technique of

1500
00:59:08,520 --> 00:59:10,760
sampling has equilibrium between the F1

1501
00:59:10,760 --> 00:59:14,480
R F1 R and A score so the takeaway one

1502
00:59:14,480 --> 00:59:16,680
is the macro F1 score improves

1503
00:59:16,680 --> 00:59:18,480
dramatically using the sampl sampling

1504
00:59:18,480 --> 00:59:20,160
techniqu so these models might be better

1505
00:59:20,160 --> 00:59:22,440
compared to the balanced ones all right

1506
00:59:22,440 --> 00:59:26,280
so based on all this uh evaluation the

1507
00:59:26,280 --> 00:59:27,680
data scientist says they're going to

1508
00:59:27,680 --> 00:59:29,920
continue to work with these two models

1509
00:59:29,920 --> 00:59:31,440
all right and the balance begging one

1510
00:59:31,440 --> 00:59:33,079
and then continue to make further

1511
00:59:33,079 --> 00:59:35,039
comparisons all right so then we

1512
00:59:35,039 --> 00:59:37,079
continue to keep refining on our

1513
00:59:37,079 --> 00:59:38,599
evaluation work here we're going to

1514
00:59:38,599 --> 00:59:41,000
train the models one more time again so

1515
00:59:41,000 --> 00:59:43,039
we again do a training test plate and

1516
00:59:43,039 --> 00:59:44,799
then we do that for this particular uh

1517
00:59:44,799 --> 00:59:47,039
approach model and then we print out we

1518
00:59:47,039 --> 00:59:48,200
print out what is called a

1519
00:59:48,200 --> 00:59:50,960
classification report and this is

1520
00:59:50,960 --> 00:59:53,400
basically a summary of all those metrics

1521
00:59:53,400 --> 00:59:55,359
that I talk about just now so just now

1522
00:59:55,359 --> 00:59:57,520
remember I said the the there was

1523
00:59:57,520 --> 00:59:59,680
several evaluation metrics right so uh

1524
00:59:59,680 --> 01:00:01,480
we had the confusion matrics the

1525
01:00:01,480 --> 01:00:04,119
accuracy the Precision the recall the Au

1526
01:00:04,119 --> 01:00:08,119
ccore so here with the um classification

1527
01:00:08,119 --> 01:00:09,880
report I can get a summary of all of

1528
01:00:09,880 --> 01:00:11,760
that so I can see all the values here

1529
01:00:11,760 --> 01:00:14,640
okay for this particular model begging

1530
01:00:14,640 --> 01:00:17,160
Tomac links and then I can do that for

1531
01:00:17,160 --> 01:00:18,640
another model the random Forest

1532
01:00:18,640 --> 01:00:20,599
borderline SME and then I can do that

1533
01:00:20,599 --> 01:00:22,200
for another model which is the balance

1534
01:00:22,200 --> 01:00:25,160
ping so again we see this a lot of

1535
01:00:25,160 --> 01:00:27,079
comparison between different models

1536
01:00:27,079 --> 01:00:28,640
trying to figure out what all these

1537
01:00:28,640 --> 01:00:30,720
evaluation metrics are telling us all

1538
01:00:30,720 --> 01:00:32,960
right then again we have a confusion

1539
01:00:32,960 --> 01:00:35,880
Matrix so we generate a confusion Matrix

1540
01:00:35,880 --> 01:00:38,880
for the bagging with the toac links

1541
01:00:38,880 --> 01:00:40,720
under sampling for the random followers

1542
01:00:40,720 --> 01:00:42,680
with the borderline mod over sampling

1543
01:00:42,680 --> 01:00:44,960
and just balance begging by itself then

1544
01:00:44,960 --> 01:00:47,720
again we compare between these three uh

1545
01:00:47,720 --> 01:00:50,799
models uh using the confusion Matrix

1546
01:00:50,799 --> 01:00:52,599
evaluation Matrix and then we can kind

1547
01:00:52,599 --> 01:00:55,680
of come to some conclusions all right so

1548
01:00:55,680 --> 01:00:58,160
right so now we look at all the data

1549
01:00:58,160 --> 01:01:01,200
then we move on and look at another um

1550
01:01:01,200 --> 01:01:03,160
another kind of evaluation metrix which

1551
01:01:03,160 --> 01:01:06,720
is the r score right so this is one of

1552
01:01:06,720 --> 01:01:08,680
the other evaluation metrics I talk

1553
01:01:08,680 --> 01:01:11,200
about so this one is a kind of a curve

1554
01:01:11,200 --> 01:01:12,520
you look at it to see the area

1555
01:01:12,520 --> 01:01:14,359
underneath the curve this is called AOC

1556
01:01:14,359 --> 01:01:18,079
R area under the curve sorry Au Au R

1557
01:01:18,079 --> 01:01:19,880
area under the curve all right so the

1558
01:01:19,880 --> 01:01:21,839
area under the curve uh

1559
01:01:21,839 --> 01:01:24,319
score will give us some idea about the

1560
01:01:24,319 --> 01:01:25,599
threshold that we're going to use for

1561
01:01:25,599 --> 01:01:27,680
classif ification so we can examine this

1562
01:01:27,680 --> 01:01:29,200
for the bagging classifier for the

1563
01:01:29,200 --> 01:01:30,960
random forest classifier for the balance

1564
01:01:30,960 --> 01:01:33,599
bagging classifier okay then we can also

1565
01:01:33,599 --> 01:01:36,200
again do that uh finally we can check

1566
01:01:36,200 --> 01:01:37,880
the classification report of this

1567
01:01:37,880 --> 01:01:39,680
particular model so we keep doing this

1568
01:01:39,680 --> 01:01:43,200
over and over again evaluating this m

1569
01:01:43,200 --> 01:01:45,720
The Matrix the the accuracy Matrix the

1570
01:01:45,720 --> 01:01:46,880
evaluation Matrix for all these

1571
01:01:46,880 --> 01:01:48,880
different models so we keep doing this

1572
01:01:48,880 --> 01:01:50,520
over and over again for different

1573
01:01:50,520 --> 01:01:53,440
thresholds or for classification and so

1574
01:01:53,440 --> 01:01:56,880
as we keep drilling into these we kind

1575
01:01:56,880 --> 01:02:00,839
of get more and more understanding of

1576
01:02:00,839 --> 01:02:02,799
all these different models which one is

1577
01:02:02,799 --> 01:02:04,760
the best one that gives the best

1578
01:02:04,760 --> 01:02:08,520
performance for our data set okay so

1579
01:02:08,520 --> 01:02:11,440
finally we come to this conclusion this

1580
01:02:11,440 --> 01:02:13,520
particular model is not able to reduce

1581
01:02:13,520 --> 01:02:15,279
the record on failure test than

1582
01:02:15,279 --> 01:02:17,520
95.8% on the other hand balance begging

1583
01:02:17,520 --> 01:02:19,400
with a decision thresold of 0.6 is able

1584
01:02:19,400 --> 01:02:21,520
to have a better recall blah blah blah

1585
01:02:21,520 --> 01:02:25,319
Etc so finally after having done all of

1586
01:02:25,319 --> 01:02:27,480
this evalu ations

1587
01:02:27,480 --> 01:02:31,119
okay this is the conclusion

1588
01:02:31,119 --> 01:02:33,960
so after having gone so right now we

1589
01:02:33,960 --> 01:02:35,279
have gone through all the steps of the

1590
01:02:35,279 --> 01:02:37,760
Machining learning life cycle and which

1591
01:02:37,760 --> 01:02:40,240
means we have right now or the data

1592
01:02:40,240 --> 01:02:41,960
scientist right now has gone through all

1593
01:02:41,960 --> 01:02:43,000
these

1594
01:02:43,000 --> 01:02:47,079
steps uh which is now we have done this

1595
01:02:47,079 --> 01:02:48,640
validation so we have done the cleaning

1596
01:02:48,640 --> 01:02:50,559
exploration preparation transformation

1597
01:02:50,559 --> 01:02:52,599
the future engineering we have developed

1598
01:02:52,599 --> 01:02:54,359
and trained multiple models we have

1599
01:02:54,359 --> 01:02:56,480
evaluated all these different models so

1600
01:02:56,480 --> 01:02:58,599
right now we have reached this stage so

1601
01:02:58,599 --> 01:03:02,720
at this stage we as the data scientist

1602
01:03:02,720 --> 01:03:05,480
kind of have completed our job so we've

1603
01:03:05,480 --> 01:03:08,119
come to some very useful conclusions

1604
01:03:08,119 --> 01:03:09,640
which we now can share with our

1605
01:03:09,640 --> 01:03:13,240
colleagues all right and based on this

1606
01:03:13,240 --> 01:03:15,400
uh conclusions or recommendations

1607
01:03:15,400 --> 01:03:17,160
somebody is going to choose a

1608
01:03:17,160 --> 01:03:19,160
appropriate model and that model is

1609
01:03:19,160 --> 01:03:22,640
going to get deployed for realtime use

1610
01:03:22,640 --> 01:03:25,319
in a real life production environment

1611
01:03:25,319 --> 01:03:27,240
okay and that decision is going to be

1612
01:03:27,240 --> 01:03:29,359
made based on the recommendations coming

1613
01:03:29,359 --> 01:03:30,880
from the data scientist at the end of

1614
01:03:30,880 --> 01:03:33,480
this phase okay so at the end of this

1615
01:03:33,480 --> 01:03:35,079
phase the data scientist is going to

1616
01:03:35,079 --> 01:03:36,880
come up with these conclusions so

1617
01:03:36,880 --> 01:03:41,760
conclusions is okay if the engineering

1618
01:03:41,760 --> 01:03:44,520
team they are looking okay the

1619
01:03:44,520 --> 01:03:46,119
engineering team right the engineering

1620
01:03:46,119 --> 01:03:48,720
team if they are looking for the highest

1621
01:03:48,720 --> 01:03:51,839
failure detection rate possible then

1622
01:03:51,839 --> 01:03:54,480
they should go with this particular

1623
01:03:54,480 --> 01:03:56,520
model okay

1624
01:03:56,520 --> 01:03:58,680
and if they want a balance between

1625
01:03:58,680 --> 01:04:01,039
precision and recall then they should

1626
01:04:01,039 --> 01:04:03,240
choose between the begging model with a

1627
01:04:03,240 --> 01:04:05,960
0.4 decision threshold or the random

1628
01:04:05,960 --> 01:04:09,599
forest model with a 0.5 threshold but if

1629
01:04:09,599 --> 01:04:11,880
they don't care so much about predicting

1630
01:04:11,880 --> 01:04:14,480
every failure and they want the highest

1631
01:04:14,480 --> 01:04:16,760
Precision possible then they should opt

1632
01:04:16,760 --> 01:04:19,799
for the begging toax link classifier

1633
01:04:19,799 --> 01:04:23,160
with a bit higher decision threshold and

1634
01:04:23,160 --> 01:04:26,160
so this is the key thing that the data

1635
01:04:26,160 --> 01:04:28,319
scientist is going to give right this is

1636
01:04:28,319 --> 01:04:30,760
the key takeaway this is the kind of the

1637
01:04:30,760 --> 01:04:32,680
end result of the entire machine

1638
01:04:32,680 --> 01:04:34,680
learning life cycle right now the data

1639
01:04:34,680 --> 01:04:36,400
scientist is going to tell the

1640
01:04:36,400 --> 01:04:38,599
engineering team all right you guys

1641
01:04:38,599 --> 01:04:41,160
which is more important for you point a

1642
01:04:41,160 --> 01:04:45,039
point B or Point C make your decision so

1643
01:04:45,039 --> 01:04:47,400
the engineering team will then discuss

1644
01:04:47,400 --> 01:04:48,960
among themselves and say hey you know

1645
01:04:48,960 --> 01:04:52,279
what what we want is we want to get the

1646
01:04:52,279 --> 01:04:54,720
highest failure detection possible

1647
01:04:54,720 --> 01:04:58,359
because any kind kind of failure of that

1648
01:04:58,359 --> 01:05:00,400
machine or the product on the samply

1649
01:05:00,400 --> 01:05:03,119
line is really going to screw us up big

1650
01:05:03,119 --> 01:05:05,640
time so what we're looking for is the

1651
01:05:05,640 --> 01:05:08,079
model that will give us the highest

1652
01:05:08,079 --> 01:05:10,880
failure detection rate we don't care

1653
01:05:10,880 --> 01:05:13,480
about Precision but we want to be make

1654
01:05:13,480 --> 01:05:15,440
sure that if there's a failure we are

1655
01:05:15,440 --> 01:05:17,720
going to catch it right so that's what

1656
01:05:17,720 --> 01:05:19,599
they want and so the data scientist will

1657
01:05:19,599 --> 01:05:22,200
say Hey you go for the balance begging

1658
01:05:22,200 --> 01:05:24,880
model okay then the data scientist saves

1659
01:05:24,880 --> 01:05:27,720
this all right uh and then once you have

1660
01:05:27,720 --> 01:05:30,000
saved this uh you can then go right

1661
01:05:30,000 --> 01:05:32,319
ahead and deploy that so you can go

1662
01:05:32,319 --> 01:05:33,520
right ahead and deploy that to

1663
01:05:33,520 --> 01:05:37,160
production okay and so if you want to

1664
01:05:37,160 --> 01:05:38,839
continue we can actually further

1665
01:05:38,839 --> 01:05:41,119
continue this modeling problem so just

1666
01:05:41,119 --> 01:05:43,480
now I model this problem as a binary

1667
01:05:43,480 --> 01:05:46,720
classification problem uh sorry just I

1668
01:05:46,720 --> 01:05:48,240
modeled this problem as a binary

1669
01:05:48,240 --> 01:05:49,520
classification which means it's either

1670
01:05:49,520 --> 01:05:51,680
zero or one either fail or not fail but

1671
01:05:51,680 --> 01:05:53,599
we can also model it as a multiclass

1672
01:05:53,599 --> 01:05:55,640
classification problem right because as

1673
01:05:55,640 --> 01:05:57,640
as I said earlier just now for the

1674
01:05:57,640 --> 01:06:00,200
Target variable colum which is sorry for

1675
01:06:00,200 --> 01:06:02,520
the failure type colume you actually

1676
01:06:02,520 --> 01:06:04,839
have multiple kinds of failures right

1677
01:06:04,839 --> 01:06:07,559
for example you may have a power failure

1678
01:06:07,559 --> 01:06:10,000
uh you may have a towar failure uh you

1679
01:06:10,000 --> 01:06:12,920
may have a overstrain failure so now we

1680
01:06:12,920 --> 01:06:14,839
can model the problem slightly

1681
01:06:14,839 --> 01:06:17,240
differently so we can model it as a

1682
01:06:17,240 --> 01:06:19,680
multiclass classification problem and

1683
01:06:19,680 --> 01:06:21,160
then we go through the entire same

1684
01:06:21,160 --> 01:06:22,680
process that we went through just now so

1685
01:06:22,680 --> 01:06:24,880
we create different models we test this

1686
01:06:24,880 --> 01:06:26,720
out but now the confusion Matrix is for

1687
01:06:26,720 --> 01:06:30,119
a multiclass classification isue right

1688
01:06:30,119 --> 01:06:30,960
so we're going

1689
01:06:30,960 --> 01:06:34,039
to check them out we're going to again

1690
01:06:34,039 --> 01:06:36,079
uh try different algorithms or models

1691
01:06:36,079 --> 01:06:38,039
again train and test our data set do the

1692
01:06:38,039 --> 01:06:39,760
training test split uh on these

1693
01:06:39,760 --> 01:06:42,000
different models all right so we have

1694
01:06:42,000 --> 01:06:43,400
like for example we have bon random

1695
01:06:43,400 --> 01:06:46,160
Forest B random Forest a great search

1696
01:06:46,160 --> 01:06:47,720
then you train the models using what is

1697
01:06:47,720 --> 01:06:49,680
called hyperparameter tuning then you

1698
01:06:49,680 --> 01:06:51,079
get the scores all right so you get the

1699
01:06:51,079 --> 01:06:53,160
same evaluation scores again you check

1700
01:06:53,160 --> 01:06:54,599
out the evaluation scores compare

1701
01:06:54,599 --> 01:06:57,079
between them generate a confusion Matrix

1702
01:06:57,079 --> 01:06:59,960
so this is a multiclass confusion Matrix

1703
01:06:59,960 --> 01:07:02,400
and then you come to the final

1704
01:07:02,400 --> 01:07:05,760
conclusion so now if you are interested

1705
01:07:05,760 --> 01:07:09,000
to frame your problem domain as a

1706
01:07:09,000 --> 01:07:11,359
multiclass classification problem all

1707
01:07:11,359 --> 01:07:13,839
right then these are the recommendations

1708
01:07:13,839 --> 01:07:15,480
from the data scientist so the data

1709
01:07:15,480 --> 01:07:17,240
scientist will say you know what I'm

1710
01:07:17,240 --> 01:07:19,559
going to pick this particular model the

1711
01:07:19,559 --> 01:07:22,039
balance backing classifier and these are

1712
01:07:22,039 --> 01:07:24,520
all the reasons that the data scientist

1713
01:07:24,520 --> 01:07:27,279
is going to give as a rational for

1714
01:07:27,279 --> 01:07:29,400
selecting this particular

1715
01:07:29,400 --> 01:07:32,039
model and then once that's done you save

1716
01:07:32,039 --> 01:07:35,000
the model and that's that's it that's it

1717
01:07:35,000 --> 01:07:38,920
so that's all done now and so then the

1718
01:07:38,920 --> 01:07:41,039
uh the model the machine learning model

1719
01:07:41,039 --> 01:07:43,720
now you can put it live run it on the

1720
01:07:43,720 --> 01:07:45,279
server and now the machine learning

1721
01:07:45,279 --> 01:07:47,200
model is ready to work which means it's

1722
01:07:47,200 --> 01:07:48,920
ready to generate predictions right

1723
01:07:48,920 --> 01:07:50,279
that's the main job of the machine

1724
01:07:50,279 --> 01:07:52,039
learning model you have picked the best

1725
01:07:52,039 --> 01:07:53,680
machine learning model with the best

1726
01:07:53,680 --> 01:07:55,799
evaluation metrics for whatever accur

1727
01:07:55,799 --> 01:07:57,760
see goal you're trying to achieve and

1728
01:07:57,760 --> 01:07:59,640
now you're going to run it on a server

1729
01:07:59,640 --> 01:08:00,799
and now you're going to get all this

1730
01:08:00,799 --> 01:08:02,960
real time data that's coming from your

1731
01:08:02,960 --> 01:08:04,520
sensus you're going to pump that into

1732
01:08:04,520 --> 01:08:06,359
your machine learning model your machine

1733
01:08:06,359 --> 01:08:07,880
learning model will pump out a whole

1734
01:08:07,880 --> 01:08:09,520
bunch of predictions and we're going to

1735
01:08:09,520 --> 01:08:12,799
use that predictions in real time to

1736
01:08:12,799 --> 01:08:15,400
make real time real world decision

1737
01:08:15,400 --> 01:08:17,560
making right you're going to say okay

1738
01:08:17,560 --> 01:08:19,600
I'm predicting that that machine is

1739
01:08:19,600 --> 01:08:23,198
going to fail on Thursday at 5:00 p.m.

1740
01:08:23,198 --> 01:08:25,520
so you better get your service folks in

1741
01:08:25,520 --> 01:08:28,640
to service it on Thursday 2: p.m. or you

1742
01:08:28,640 --> 01:08:31,640
know whatever so you can you know uh

1743
01:08:31,640 --> 01:08:33,479
make decisions on when you want to do

1744
01:08:33,479 --> 01:08:35,319
your maintenance you know and and make

1745
01:08:35,319 --> 01:08:37,640
the best decisions to optimize the cost

1746
01:08:37,640 --> 01:08:41,158
of Maintenance etc etc and then based on

1747
01:08:41,158 --> 01:08:42,120
the

1748
01:08:42,120 --> 01:08:45,000
results that are coming up from the

1749
01:08:45,000 --> 01:08:46,759
predictions so the predictions may be

1750
01:08:46,759 --> 01:08:49,120
good the predictions may be lousy the

1751
01:08:49,120 --> 01:08:51,359
predictions may be average right so we

1752
01:08:51,359 --> 01:08:53,719
are we're constantly monitoring how good

1753
01:08:53,719 --> 01:08:55,439
or how useful are the predictions

1754
01:08:55,439 --> 01:08:57,759
generated by this realtime model that's

1755
01:08:57,759 --> 01:08:59,880
running on the server and based on our

1756
01:08:59,880 --> 01:09:02,679
monitoring we will then take some new

1757
01:09:02,679 --> 01:09:05,319
data and then repeat this entire life

1758
01:09:05,319 --> 01:09:07,040
cycle again so this is basically a

1759
01:09:07,040 --> 01:09:09,238
workflow that's iterative and we are

1760
01:09:09,238 --> 01:09:11,120
constantly or the data scientist is

1761
01:09:11,120 --> 01:09:13,319
constantly getting in all these new data

1762
01:09:13,319 --> 01:09:15,279
points and then refining the model

1763
01:09:15,279 --> 01:09:17,960
picking maybe a new model deploying the

1764
01:09:17,960 --> 01:09:21,679
new model onto the server and so on all

1765
01:09:21,679 --> 01:09:23,920
right and so that's it so that is

1766
01:09:23,920 --> 01:09:26,399
basically your machine learning workflow

1767
01:09:26,399 --> 01:09:29,479
in a nutshell okay so for this

1768
01:09:29,479 --> 01:09:32,080
particular approach we have used a bunch

1769
01:09:32,080 --> 01:09:34,560
of uh data science libraries from python

1770
01:09:34,560 --> 01:09:36,520
so we have used pandas which is the most

1771
01:09:36,520 --> 01:09:38,560
B basic data science libraries that

1772
01:09:38,560 --> 01:09:40,279
provides all the tools to work with raw

1773
01:09:40,279 --> 01:09:42,520
data we have used numai which is a high

1774
01:09:42,520 --> 01:09:44,080
performance library for implementing

1775
01:09:44,080 --> 01:09:46,439
complex array metrix operations we have

1776
01:09:46,439 --> 01:09:49,560
used met plot lip and cbon which is used

1777
01:09:49,560 --> 01:09:52,439
for doing the Eda the explorat

1778
01:09:52,439 --> 01:09:55,560
exploratory data analysis phase machine

1779
01:09:55,560 --> 01:09:57,040
learning where you visualize all your

1780
01:09:57,040 --> 01:09:59,040
data we have used psyit learn which is

1781
01:09:59,040 --> 01:10:01,280
the machine L learning library to do all

1782
01:10:01,280 --> 01:10:02,920
your implementation for all your call

1783
01:10:02,920 --> 01:10:06,000
machine learning algorithms uh we we we

1784
01:10:06,000 --> 01:10:08,000
have not used this because this is not a

1785
01:10:08,000 --> 01:10:11,040
deep learning uh problem but if you are

1786
01:10:11,040 --> 01:10:12,800
working with a deep learning problem

1787
01:10:12,800 --> 01:10:15,360
like image classification image

1788
01:10:15,360 --> 01:10:17,840
recognition object detection okay

1789
01:10:17,840 --> 01:10:20,199
natural language processing text

1790
01:10:20,199 --> 01:10:21,920
classification well then you're going to

1791
01:10:21,920 --> 01:10:24,360
use these libraries from python which is

1792
01:10:24,360 --> 01:10:28,960
tensor flow okay and also py

1793
01:10:28,960 --> 01:10:32,679
to and then lastly that whole thing that

1794
01:10:32,679 --> 01:10:34,719
whole data science project that you saw

1795
01:10:34,719 --> 01:10:36,800
just now this entire data science

1796
01:10:36,800 --> 01:10:38,880
project is actually developed in

1797
01:10:38,880 --> 01:10:41,080
something called a Jupiter notebook so

1798
01:10:41,080 --> 01:10:44,040
all this python code along with all the

1799
01:10:44,040 --> 01:10:46,360
observations from the data

1800
01:10:46,360 --> 01:10:48,679
scientists okay for this entire data

1801
01:10:48,679 --> 01:10:50,440
science project was actually run in

1802
01:10:50,440 --> 01:10:53,360
something called a Jupiter notebook so

1803
01:10:53,360 --> 01:10:55,760
that is uh the

1804
01:10:55,760 --> 01:10:59,080
most widely used tool for interactively

1805
01:10:59,080 --> 01:11:02,360
developing and presenting data science

1806
01:11:02,360 --> 01:11:04,640
projects okay so that brings me to the

1807
01:11:04,640 --> 01:11:07,400
end of this entire presentation I hope

1808
01:11:07,400 --> 01:11:10,360
that you find it useful for you and that

1809
01:11:10,360 --> 01:11:13,199
you can appreciate the importance of

1810
01:11:13,199 --> 01:11:15,280
machine learning and how it can be

1811
01:11:15,280 --> 01:11:19,800
applied in a real life use case in a

1812
01:11:19,800 --> 01:11:23,360
typical production environment all right

1813
01:11:23,360 --> 01:11:27,239
thank you all so much for watching