-
so are Deepseek creating a
-
self-improving AI well that's the latest
-
claim coming out of this newspaper and
-
they're talking about the fact that last
-
Friday Deepseek released a paper for
-
self-improving AI models Now there are a
-
little bit of caveats to this I mean
-
there is some truth statements to this
-
but I'm going to dive into how this all
-
works and the truth about this claim So
-
you can see right here that the article
-
starts by saying "China's Deepseek finds
-
a way to help AI get better at answering
-
questions Here's how it works So this
-
actually got the Twitter sphere in a
-
hold because so many people were
-
wondering exactly what on earth this is
-
and how on earth DeepS are actually
-
changing the game when it comes to
-
having an AI improve itself Now we all
-
know that Deepseek are an open research
-
company or open source company and of
-
course that means that they publish
-
their research So right here you can see
-
exactly the research that was published
-
and it talks about inference time
-
scaling for generous reward modeling Now
-
trust me guys I'm not going to bore you
-
with the specifics of everything I'll
-
actually explain this to you in a way
-
that you can understand But the results
-
that we are seeing is that yes over time
-
with inference time scaling the model
-
does improve So what we can see here and
-
trust me guys it's actually not that
-
hard to understand is that every time
-
you sample the model we show how
-
accurate that rating becomes So let's
-
say for example we were asking the AI
-
assistants which are the reward models
-
in this case to rate how good this AI's
-
response was The graph here is showing
-
how accurate these results get over time
-
The numbers here show the AI's
-
performance and the numbers here show
-
how many times it tried Now this isn't
-
the only graph that shows you just how
-
incredible this is because I think it's
-
really important that they're also
-
showing us the other models here And
-
they show GPT4 here which is really
-
interesting Although I will say I'm not
-
sure which version of GPT40 this is as
-
there are multiple different versions of
-
GPT40 that currently do exist So this
-
one outperforming it with inference time
-
is a little bit of a caveat because
-
GPT40 isn't actually an inference time
-
model I'm guessing they're just showing
-
just how good the model gets in terms of
-
overall performance So actually we'll
-
explain to you guys how this works
-
because I think this probably is going
-
to be used in the base of the next model
-
Deepseek R2 which is probably going to
-
be Deep Seek's next Frontier model that
-
could outperform the next level of AI So
-
how this entire thing works and I'm just
-
going to break this down for you guys is
-
the goal here is to get the AI to
-
improve Let's just say in this example
-
chat GBT So now if they want to get the
-
AI to improve is to train it to have
-
another AI actually judge how good the
-
answers are And this judge is basically
-
called a reward model And this AI paper
-
basically wants to create a very good
-
versatile judge Now there's a few
-
problems with current AI judges And
-
number one is that they're not general
-
enough They might be good at judging
-
math answers but when you know you judge
-
creative answers or vice versa they just
-
don't perform that well And now another
-
thing as well is that these AI judges
-
they don't improve that much on the spot
-
So giving them more computer power with
-
you know inference time doesn't
-
necessarily make their judgment much
-
better They actually improve most times
-
after training And so because we have
-
these ingrained issues into the judges
-
when we're trying to improve the model's
-
responses with these reward models we of
-
course have to come up with a better
-
solution And that's where Deepseek
-
solution comes into the picture This is
-
where we get the GRM judge So the paper
-
solution is called Deepseek GRM Now this
-
is a different kind of judge Instead of
-
just outputting a score like seven out
-
of 10 their judge writes out the
-
reasoning and then it explains why an
-
answer is good or bad based on specific
-
principles and these are going to be the
-
rules that it manages to come up with
-
Then from that reasoning the score is
-
then extracted Now the reason that
-
they've chosen to do this is because
-
this is more flexible It's more detailed
-
and more importantly if you ask it to
-
judge the same thing multiple times it
-
might write multiple slightly different
-
reasons or principles leading to
-
slightly different scores And so what
-
they actually use is they train this
-
judge with SPCT So this is where they
-
use reinforcement learning which is
-
where you know you can like train an AI
-
to play a game which is like training an
-
AI to play a game The judge AI practices
-
generating principles and critiques and
-
its final judgment is based on if it
-
matches the correct judgment and then of
-
course it gets a reward and it learns to
-
do that more So over time they're
-
reinforcing the good behavior from this
-
model And the reason they're doing this
-
is because this teaches the judge AI to
-
generate good principles and critiques
-
that lead to accurate judgments that
-
over the time make it smarter Now of
-
course making it better on the spot is
-
where we have the inference scaling So
-
you know that famous famous paradigm
-
that everyone is talking about is once
-
again making headwinds again So what
-
they do here is they ask it multiple
-
times which is called sampling So when
-
they want a judge over here to get an
-
answer they ask their trained judge
-
multiple times So they'll ask it eight
-
or 32 times Then what they'll do is
-
they'll then combine those judgments
-
which is where they'll have voting So
-
they'll collect all the scores from the
-
multiple tries and combine them into you
-
know I guess you could say an average
-
And all of this is happening at
-
inference time They then have smart
-
combining which is where they trained
-
another tiny AI called a metarm whose
-
only job is to quickly rate how good
-
each written critique from the main
-
judge was Then they combine the scores
-
only from the critiques that the Metar
-
RM thought were good And this is because
-
asking multiple times and combining
-
results especially the meta RM helper
-
makes the final judgment much more
-
reliable than just asking once And this
-
is much more compute inensive but it
-
does give a better result Now overall
-
you might be thinking all right that is
-
pretty interesting but what are the
-
results here so the results are that
-
this of course performs really well The
-
AI judge works very well across many
-
different tasks The ask multiple times
-
strategy dramatically improves the AI's
-
accuracy and the more times they ask it
-
or the more compute they use the better
-
it gets And their judge AI even though
-
medium-sized could outperform much
-
larger AIs like
-
GPT40 used as a judge And if those
-
larger AIs were only asked once so
-
overall it's looking really good And
-
using the meta RM that tiny AI that
-
little helper to pick out the best
-
critiques works even better than simple
-
voting They basically built a really
-
smart AI judge that explains its
-
reasoning They trained it in a really
-
clever way And crucially this judge gets
-
better if you let it think multiple
-
times which is just the same as using
-
more compute And this allows a
-
moderately sized AI judge to achieve top
-
performance when needed So overall this
-
is once again something really
-
interesting because we know that
-
Deepseek are continuing to push the
-
frontier in terms of innovation And I
-
will say it's really interesting that
-
we're seeing that the tides are shifting
-
It's not like China is copying the West
-
anymore They're actually innovating on
-
their own front Now one thing that I do
-
want to talk about is of course the fact
-
that R2 is coming We do know that Deep
-
Seek's new Frontier models are going to
-
be coming fairly soon And with this
-
recent research paper I am wondering if
-
that is going to be part of their next
-
AI release We know that they are working
-
super hard and some would argue that
-
they're even ahead of recent Frontier
-
Labs I mean a lot of people know about
-
Meta's Llama 4 and how there have been
-
many many different mishaps with regards
-
to how that model has performed So
-
Deepseek rushing to launch a new AI
-
model and they are going to be going
-
allin and this article was in February
-
2025 So I am wondering when this model
-
is going to be released because we still
-
do have many other open-source AIs like
-
Quen coming soon and potentially even
-
GPT5 and 03 Mini So it is super super
-
important that Deepseek do maintain the
-
momentum as it is quite hard to stay
-
relevant in the AI space Now they do
-
state that it is possible that they will
-
be releasing the model as early as May
-
because that was when their plan was
-
They plan to release it in May but now
-
they want it as early as possible So I
-
wouldn't personally be surprised if we
-
do get Deep Seek R2 potentially at the
-
end of this month I mean it's a once in
-
a-lifetime opportunity to put your
-
company out even further ahead of OpenAI
-
and steal the limelight once again I do
-
think that this could potentially happen
-
but of course we will have to see And
-
individuals within the industry are
-
saying that the launch of Deep Seek's R2
-
model could be a pivotal moment in the
-
AI industry And I do wonder how this
-
news of Deepseek's further advancements
-
are going to affect companies like Meta
-
and their open- source efforts Of course
-
like I said before if you watched
-
yesterday's video you'll know that there
-
was a ton of controversy around the
-
Llama 4 release with many critics
-
arguing that they fixed the benchmarks
-
and the model just simply isn't that
-
good So it will be interesting to see
-
exactly how that announcement of R2 when
-
it is released affects the AI industry