-
Okay. In this video, we'll be discussing
-
about how we can implement linear
-
regression in Splunk MLTK, okay? So
-
in my previous video, we have seen how we
-
can install Splunk MLTK and it's
-
related packages, right? And also if you
-
remember when I was discussing about the
-
machine learning core algorithm, I was
-
also introduced the core dataset we'll
-
be using for our linear regression
-
modeling, okay?
-
That's the graduate admission dataset
-
where we have for various students we
-
have their GRE score, TOEFL score,
-
university rating, statement of purpose
-
rating okay,
-
reference rating, CGPA, whether
-
they have done research or not. Based on
-
all these fields, we will try to
-
predict the chances of admit, okay? So now
-
to implement linear regression- so we
-
will be implementing linear
-
regression for this one and see how best
-
the model is fitting the particular data,
-
okay? So to implement linear
-
regression, what you have to do you have
-
to go to a Splunk machine learning
-
toolkit, okay? As I stated before, the
-
landing page of the machine learning
-
toolkit app is this showcased dashboard,
-
right? Where it has basically a lot of
-
examples based on whatever the
-
different algorithm- machine learning
-
algorithms Splunk supports, okay? Now to
-
implement the machine learning on your
-
own dataset, what you need to do is you need to come
-
to experiments tab, okay? So now if you do
-
not have any other models or if it is
-
the first time you are coming to this
-
particular dashboard, this will be the
-
default view, okay? But if you have
-
already experimented on different models,
-
the view will be slightly different
-
which we'll see later, ok? So now as in
-
linear regression we are trying to do a
-
prediction on the numeric fields, right?
-
So we will go over here, okay?
-
The predict numeric field.
-
We're clicking over here. Now it is
-
asking me for an experiment title and a
-
description. So I will say graduate
-
admission prediction. Let's give the
-
experiment title like this one
-
prediction, okay? Now you got to give some
-
description as well, meaningful
-
description. So I'll click on create, okay?
-
So now this particular view comes up
-
over here. Now, if you see here, here we
-
have two tabs, experiment settings and
-
experiment history. Initially the
-
experiment history will be blank,
-
there is nothing over here, okay? Now
-
based on the experiment settings,
-
experiment history will be updated
-
accordingly which we will see it later, okay?
-
Now, the first thing is it is asking me
-
for a search, right? So now let me
-
show you the data. So this
-
particular data I already indexed in my
-
main index. Okay so I'll just write the
-
query index equals to main and just
-
tabling it all my different, different
-
features and chances of admit, okay? So this
-
is my dataset. So this dataset, I will
-
be using it for my training purpose, not
-
the full dataset or not all the 500
-
records. Maybe some of the data I will be
-
using it for training purpose, and rest
-
of the data I will be using it for the
-
prediction purpose just to see how my
-
model is working, okay? So I'll give this
-
query over here, and then I'll click on
-
search, okay? So by default, it is showing me
-
this is the my data, initial data preview,
-
right? Now, let's go to the next one. So
-
here if you see, there are a lot of
-
pre-processing steps over here, right? So
-
now in machine learning when you train a
-
particular model, right, you- there is
-
a- there may be some need to pre-process
-
that data so that you will reduce lot
-
of noise from the data. Now, there are a
-
lot of pre-processing algorithm
-
present over there. So when we will
-
discuss those algorithms, we'll come back
-
to this page again and work on it, okay?
-
So for now, I will not be doing any kind
-
of pre-processing because this data is
-
clean enough data, okay? So now the
-
algorithm I will be choosing, linear
-
regression. Now, there are a lot of
-
regulation algorithm, so currently we
-
studied only about linear regression, and we'll
-
be implementing linear regression only
-
in this video, so I will be choosing the
-
linear regression over here, okay? So now
-
fields to predict, that means which field
-
you want to predict. So as I will
-
predicting my chances of admit, I will
-
be choosing that. And then field used
-
for predicting. That means here basically
-
you are choosing your features, right? So
-
I will be choosing all my columns. So
-
here if you see, the concept of simple
-
linear regression and multiple linear
-
regression comes up, right? If I choose a
-
single feature, it will become a simple
-
linear regression. If I choose multiple
-
feature, it will become a multiple linear
-
regression. So for now, I will be choosing
-
for all, okay? Now here if you see, the
-
split for training, right? So here
-
basically it is what is happening is you
-
are splitting the whole dataset between
-
a training and test dataset. There are-
-
Here currently it is 50 percent, 50 percent.
-
That means the first 50 percent data
-
will be used for training and the rest
-
50 percent data will be used for testing
-
purpose. I'll slide this one, it goes like
-
this one, I'll keep 70 and 30, okay? Now,
-
fit intercept, okay? That means, if you
-
remember from my machine learning video,
-
not only we have the slope value for
-
each and every feature, we also have a
-
intercept of y axis intercept by the
-
way. So by this option, you are
-
basically choosing whether
-
your model should include an implicit
-
intercept terms or not, okay? Now notes
-
you can give some meaningful
-
notes. Maybe the notes could be like what
-
are the fields you are using for
-
prediction purpose. So- and some
-
meaningful note which will be useful in
-
later when we'll see the history of the
-
model, okay? So I will say using all the
-
features, using all the features, okay? Now
-
after all is done, you need to click on
-
fit model. So it's basically- behind the
-
scene, what it do, it runs Splunk
-
custom command which basically
-
implemented [inaudible]. So
-
using that particular command, it is
-
trying to come up with the equation of
-
that line, right? Which we discussed
-
before. And if you remember from my
-
multiple linear regression video, we come
-
up with a linear algebra solution over
-
there, right? With matrix inversion
-
and matrix transpose, right? So behind the
-
scene it is doing the same thing over
-
there, okay?
-
So now if you see, the result came up,
-
right, after clicking on the fit model.
-
Now if you see, apart from our own data,
-
it's actually added two new columns over
-
here. One is the predicted chances of
-
admit and the residual column, right? Now,
-
predicted chance of admit is actually
-
the actual prediction happen on the data,
-
right? So if you see for the first row,
-
the actual chance of admit is 0.73,
-
that means 73%. Now the predicted
-
was 0.70, that means 70 percent. Now,
-
the residual column is the difference
-
between the actual chance of admit and
-
the predicted chance of admit, okay?
-
So this is how, after fitting the model,
-
it came up with this kind of
-
visualization.
-
It also shows up, there are other five to
-
six charts over here, okay? Now let us
-
discuss one by one this one. The first
-
chart show me the actual versus
-
predicted line chart. That means
-
if you see the chance of admit, the blue
-
colored graph, is the actual one, and the
-
predicted chance of admit, the yellow
-
color one, is the prediction one, right?
-
And if you see by seeing this one, we can
-
at least see this particular model is
-
okay fit to this particular data.
-
Somewhere it is lagging over here if you
-
see it, right? But somehow it's
-
actually fitting good over there. Now the
-
residual chart, whatever you are seeing
-
it over here, the line chart it is
-
showing up over here, okay? So now the
-
more this chart particular chart is
-
close to zero, that means the model is
-
fitting really, really good. But over here
-
if you see the latter part of this one,
-
the residuals are more, right? Because it
-
is more sparse, more distance from the
-
zeroth line. And the same thing is
-
reflecting over here as well. The model
-
has some kind of lagging over here, right?
-
So this kind of analysis you can do
-
it from there, how the model is fitting
-
your data. And this particular graph is
-
showing me the scatter plot of the
-
actual and the predicted one. And here
-
basically you can see how the line
-
is fitting your data over here through
-
this chart, okay? Now, it also provides say
-
residual histogram where let us
-
understand this one as well. So we
-
have the zeroth line over here if you
-
see. It's basically shows up for each and
-
every residual value, how many counts are
-
there if you see. So if you just
-
think about it, if for all my data points
-
this residual is zero, that's the
-
ideal scenario, right? That means I am
-
predicting the [inaudible], right?
-
So from this histogram, if you see that
-
means- if you see the residual error
-
equals to zero, the sample count is 24
-
[inaudible], right? If the more and more
-
samples are very close to this zero, that
-
means my model is doing good, that's it's
-
actually good fit model. And if it is
-
more sparse, if-
-
that means if we have more number of big
-
lines over here, that means that somehow the
-
model is not good- not a good fit for
-
that particular data. So this kind of
-
interpretation, you can do it from this
-
particular diagram, okay? So now there are
-
another two things over here. It's called R squared
-
statistic and root mean square
-
error, okay? So these two are actually a measure
-
about how accurate the model is, okay? So
-
I'll be discussing this measurement in
-
very detail in separate video.
-
There we will be discussing about R squared
-
statistic, root mean square, and also some
-
other way to determine how the
-
accurate the model is. Just like bias,
-
variance, there are a lot of other
-
measurement as well
-
we'll discuss in detail over there, okay?
-
But for now, just try to remember
-
like this is the fit, measurement of fit,
-
like maybe R squared statistic we
-
can think of it it is more close to 1, it's
-
a good fit. Something like this, okay?
-
Mmm, so we will see like how to best
-
judge a model based on that, okay? But
-
still like even for R squared statistic,
-
it's all depend on the context, the field,
-
you are solving, you're implementing
-
linear regression as well. We'll discuss
-
those stuff as well in future, okay? And now,
-
if you see the last graph, it is showing
-
me the model parameters. If you remember
-
the big equation we have written into over
-
there, right? So let me open the bamboo
-
paper here. If you remember, when we
-
talked about multiple linear regression,
-
we defined- we started our discussion
-
with a big equation, right? So let me go
-
back over there.
-
Yes, so this one, right? So where beta 1,
-
beta 2, to beta P is our slow value,
-
coefficient of each and every feature.
-
And beta 0 is my intercept, right? And
-
what we are doing basically at the
-
end of the day, we came up with a big
-
equation to determine this whole beta
-
vector, right? So this is the same stuff
-
over here it is representing. So it is
-
basically giving me like for each and
-
every feature, what is the coefficient
-
value, okay? So- and the intercept value as
-
well. If you see, this is my beta 0, and my
-
beta 1 to beta P's, these guys,
-
other guys. Now, if you see it closely
-
there are some of the coefficient which
-
have very greater value. Some of the
-
coefficient which are very less value
-
over here. Like the way to interpret the
-
coefficient is like how much it is
-
influencing the end result.
-
So to understand that, let us see this
-
one. Let's say I have a variable called 'x'
-
and I am writing something like 0.9 'y'.
-
Now, what do I mean by this particular
-
equation. 0.9 into 'y', right? So that means
-
if I give 'y' equals to 1, that means
-
my 'x' will become 0.9, right? So what do
-
we mean by that? That means one unit
-
change in 'y',
-
it's basically 0.9 unit we are
-
changing in 'x', right? So this kind of
-
interpretation, you can do it. So that
-
means how 'y' is influencing 'x', right? So
-
this is how we are interpreting this
-
kind of coefficients as well in linear
-
regression. So that means we will know
-
from the coefficient itself which
-
particular feature is mostly influencing
-
that one. And now if you see it over here,
-
I think,
-
CGPA is the most influencing factor to
-
determine whether my chances of admit
-
is higher
-
or not, right? Considering we are
-
implementing a linear regression, there
-
could be a better fit of this particular
-
data which we need to experiment and see.
-
But for the current linear
-
regression implementation, we can
-
conclude this kind of stuff over here,
-
right? Correct? So this is how the model
-
parameters summary visualization table
-
visualization is telling me those
-
different those details, right? So now if
-
you see, we actually fit our model, right?
-
So we still [inaudible] that our model and
-
tell our analyst we are saving it. That's
-
why it is showing me as a draft status of
-
your model, right? And you can now go to
-
experiment history to see what you have
-
done till now. So it will be maintaining
-
a history over there. So now I can see
-
using this- all these features my R
-
squared statistic is somewhere around 78%,
-
and these are my coefficient, and I am
-
coming up with a conclusion that maybe
-
CGPA is the most influential factor over
-
here, okay? So let us do another
-
experiment, okay? So in here, I'll keep my
-
CGPA over here just to see whether it is
-
actually true or not, okay? So now what I
-
will do here is I will keep CGPA,
-
I'll keep the [inaudible], I will keep-
-
the I will keep the LOR, okay? I'll
-
keep the research one, and I will keep the
-
GRE score, okay? So I'll click over here
-
again. I will keep the GRE score. I will
-
remove the TOEFL score. I will remove the
-
university rating. I will remove the SOP.
-
CGPA, Research, and LOR I will keep. So
-
now I am trying to do this experiment
-
with four features which I am thinking
-
maybe most influential one. So maybe the
-
other feature may not have much impact
-
on this particular prediction, okay?
-
So now using only- I'll keep a note, using
-
only four features. So this is how this
-
particular note is coming into handy
-
over here, right? So it is- when I will see
-
the history, I will come to know what I
-
have done over there, okay? So I will
-
click on 'Fit Model' again. Let's see how
-
it's- how it's working now. So similar
-
stuff is happening over there. It's
-
running the custom commands.
-
In later videos, we will discuss in
-
detail of those
-
custom command as well, okay? Okay, so now if
-
you see, it again predicted that one. Now
-
if you see from the actual versus line
-
chart, it's more or less keeping same
-
even though I removed three features,
-
right? Even this one as well, more or less,
-
okay? Now if you see my R squared
-
statistics has improved a lot with 82%,
-
right? So by this one, at least I am
-
confident that really those three
-
features are not impacting much of
-
it. And if you see from this one residual
-
histogram, residuals histogram, that more
-
and more features are very close to zero,
-
right? With residual error-
-
more residual errors are very, very
-
close to zero, right?
-
So by this kind of analysis, we can say
-
this particular model is better than
-
compared to my previous model, right? So
-
now what I will do is I will save this
-
particular model, okay? So I will save, I
-
will give the experiment title as
-
'graduate_date_predictor', okay? I will
-
click on save. So now a data- a model will
-
be created, okay? So now if I just- we have
-
two options over here after you save the
-
model. Either you have to- you can go to
-
the listing page
-
or you continue editing, okay? Let us
-
continue editing to see how experiment
-
history is looking now. Now experiment
-
history has two rows over there, okay?
-
The first row is my- the current
-
experiment with my four features, right?
-
With R squared value of 82%. The second
-
row is telling me my older one, right?
-
So at any point of time, you can load
-
this corresponding settings and
-
experiment with it, okay?
-
It will also show you the data
-
corresponding to each experiment,
-
okay? So now let's go back to our
-
experiment tab and see what is happening
-
over there, okay? Now if you see my
-
experiment tab, it's not showing me
-
those big blocks, right? Mmm, it is
-
showing with this kind of view where I
-
have a predict numeric fields, a single
-
experiment I have done. I have given the
-
experiment name like this one, right?
-
The algorithm I have chosen, linear
-
regression. There are lot of actions you
-
can do on this particular model so
-
before publishing, let us talk about
-
that one, okay? You can create an alert
-
from this model just to see. So suppose-
-
the model is predicting data, right? So
-
you can choose an alert, create an alert,
-
something like when my predicted chance
-
of admit is greater than 90 percent, that
-
means 0.9, okay? Fine, 99 maybe. That means
-
the model is really, really working good
-
over there, right? So this kind of alert
-
you can do, okay? Next you can edit the
-
title and description. It's a simple
-
enough. Now you can see schedule a
-
training. This is an interesting feature
-
where we- whatever we have done till now,
-
we have done manual training over
-
there, right? Now, in the scheduled
-
training feature, that you can create a
-
scheduler which will run a training
-
based on the data. Now, if you see, there
-
is a time range over there. So you can
-
choose the time range of the data you
-
want to use for training purpose, okay?
-
That's a really interesting feature you
-
have, so that means the more and more
-
data coming to your system, you can use
-
those particular data, right, to training
-
purposes as well automatically using
-
the scheduled training, okay? And
-
similarly for other scheduling stuff, the
-
schedule priority and schedule window,
-
you can set it up as well. Even you can
-
trigger an action as well when the
-
scheduling is happening, you either you
-
can run a log, you can send the log file
-
output to a lookup, everything. This is
-
normal scheduling purposes, okay? That is
-
also you can do over here. So this is a
-
very versatile feature as well with the
-
model you can do. And now you can delete
-
it as well that's fine. So now we will
-
publish this model, okay?
-
Let's say 'chances_of_admit_model', okay?
-
This is the model name, and the
-
destination app you will be choosing
-
over here, so the model will be saved over
-
there, okay? I will be choosing my search
-
and reporting app, I will click on submit,
-
okay? So the model is created now. So how
-
the model is created in the background?
-
It's basically a lookup file, so let us
-
see that, okay? So from the Splunk home,
-
etc, apps, search, okay,
-
lookups. Okay so currently if you see it
-
over here mmm,
-
it's the- by default the model is saved
-
as a user context so it is- that's why it
-
is not coming up under search. So further
-
what I need to do, I need to go to etc,
-
then I need to go to users. Currently I'm
-
the admin user, go to admin, and
-
I'll go to the search app. And here in
-
the lookup folder, this is how the model
-
is getting stored over there, okay? So I
-
think this lookup is in read-only format,
-
so if I just open it in notepad- so this is
-
how it looks like. So this is
-
basically saving lot of the information,
-
the metadata related information about
-
the model over here, okay? What are the
-
feature variables, whatever the columns I
-
have in my data, okay? All of these things
-
[inaudible] others
-
features which we do not have any
-
control about, it is saving over there,
-
okay? So now we created our own model,
-
right? Now we need to apply this, right? How we
-
are going to apply this? There is a
-
command called apply in Splunk MLTK,
-
okay? So by using that command, you can
-
apply that particular model on any dataset,
-
okay? Or specifically we'll be
-
doing in dataset itself, otherwise if you apply
-
that model on any [inaudible] dataset,
-
it will anyhow not gonna not
-
going to give you a proper results. So
-
this is how you will be applying the
-
model. So I'll have my- this is my dataset,
-
base dataset, right? I'll just
-
choose [inaudible] last hundred records,
-
okay? Let's last 200 records, okay? Now I
-
will be using the apply command. Don't
-
worry about it, I will be discussing this
-
Splunk MLTK commands in detail in
-
my next video. So here we will just see
-
how we are just applying the model. So
-
now I will see my apply command, then my
-
model name, right? So we have given our
-
model name as 'chances_of_admit_model'.
-
I'll just copy it, okay? And I will just run it.
-
So what it should do basically, it will
-
apply this particular model or that what
-
whatever, okay. So it is permission denied
-
it is saying now. So for that what I need
-
to do is settings, lookups, okay, lookup
-
table files, I'll choose this one, search
-
and reporting, okay? This is my chances of
-
admit model. Currently it is in private
-
mode, that's why I am not able to apply
-
it on from the search app. So I'll choose
-
this app only, read write currently I will
-
give, I'll click on save,
-
okay? Internal error, data could not be
-
written on to- okay. So let me see what's
-
going on over there. Okay so I think
-
there was some technical glitch, so I
-
just did the permission again. And I-
-
this time I chosen all apps, I think it works
-
now. So now let us see whether our search
-
is working or not.
-
Okay so I have taken the last 200
-
records and I'm just clicking on apply,
-
the machine learning one, machine
-
learning model. So it is- if you see that
-
it is applying that model on this
-
particular two hundred records, two
-
hundred events over there, and it has
-
created a new column called predicted
-
chances of admit, okay? So this is how we
-
are applying that model. Even you can
-
create your own alert using this
-
particular command as well, so that
-
whenever you want something
-
like chances of admit is more than 90
-
percent 80 percent or any other
-
[inaudible] you want, you can use this
-
particular command to achieve that
-
same thing over there, okay? So this is
-
how you can experiment with machine
-
learning, specifically the linear
-
regression in Splunk MLTK. And we
-
saw that a lot of experiments we have
-
done it regarding this one, right? So this
-
is how you experiment with your data as
-
well and see how its best fit
-
your data, and you can achieve a lot of
-
other stuff like automatically training,
-
creating alerts from these things as
-
well, okay? In next video, we will talk
-
more details, we will basically deep dive
-
into what basically internally happening
-
over here. We will talk about different
-
Splunk commands internally running, the
-
custom commands internal running. And
-
whatever we have done, this experiment we
-
have done from the UI, the same thing can
-
be achieved from the search
-
command
-
as well from Splunk SPL as well, okay? See
-
you in next video.