so are Deepseek creating a
self-improving AI well that's the latest
claim coming out of this newspaper and
they're talking about the fact that last
Friday Deepseek released a paper for
self-improving AI models Now there are a
little bit of caveats to this I mean
there is some truth statements to this
but I'm going to dive into how this all
works and the truth about this claim So
you can see right here that the article
starts by saying "China's Deepseek finds
a way to help AI get better at answering
questions Here's how it works So this
actually got the Twitter sphere in a
hold because so many people were
wondering exactly what on earth this is
and how on earth DeepS are actually
changing the game when it comes to
having an AI improve itself Now we all
know that Deepseek are an open research
company or open source company and of
course that means that they publish
their research So right here you can see
exactly the research that was published
and it talks about inference time
scaling for generous reward modeling Now
trust me guys I'm not going to bore you
with the specifics of everything I'll
actually explain this to you in a way
that you can understand But the results
that we are seeing is that yes over time
with inference time scaling the model
does improve So what we can see here and
trust me guys it's actually not that
hard to understand is that every time
you sample the model we show how
accurate that rating becomes So let's
say for example we were asking the AI
assistants which are the reward models
in this case to rate how good this AI's
response was The graph here is showing
how accurate these results get over time
The numbers here show the AI's
performance and the numbers here show
how many times it tried Now this isn't
the only graph that shows you just how
incredible this is because I think it's
really important that they're also
showing us the other models here And
they show GPT4 here which is really
interesting Although I will say I'm not
sure which version of GPT40 this is as
there are multiple different versions of
GPT40 that currently do exist So this
one outperforming it with inference time
is a little bit of a caveat because
GPT40 isn't actually an inference time
model I'm guessing they're just showing
just how good the model gets in terms of
overall performance So actually we'll
explain to you guys how this works
because I think this probably is going
to be used in the base of the next model
Deepseek R2 which is probably going to
be Deep Seek's next Frontier model that
could outperform the next level of AI So
how this entire thing works and I'm just
going to break this down for you guys is
the goal here is to get the AI to
improve Let's just say in this example
chat GBT So now if they want to get the
AI to improve is to train it to have
another AI actually judge how good the
answers are And this judge is basically
called a reward model And this AI paper
basically wants to create a very good
versatile judge Now there's a few
problems with current AI judges And
number one is that they're not general
enough They might be good at judging
math answers but when you know you judge
creative answers or vice versa they just
don't perform that well And now another
thing as well is that these AI judges
they don't improve that much on the spot
So giving them more computer power with
you know inference time doesn't
necessarily make their judgment much
better They actually improve most times
after training And so because we have
these ingrained issues into the judges
when we're trying to improve the model's
responses with these reward models we of
course have to come up with a better
solution And that's where Deepseek
solution comes into the picture This is
where we get the GRM judge So the paper
solution is called Deepseek GRM Now this
is a different kind of judge Instead of
just outputting a score like seven out
of 10 their judge writes out the
reasoning and then it explains why an
answer is good or bad based on specific
principles and these are going to be the
rules that it manages to come up with
Then from that reasoning the score is
then extracted Now the reason that
they've chosen to do this is because
this is more flexible It's more detailed
and more importantly if you ask it to
judge the same thing multiple times it
might write multiple slightly different
reasons or principles leading to
slightly different scores And so what
they actually use is they train this
judge with SPCT So this is where they
use reinforcement learning which is
where you know you can like train an AI
to play a game which is like training an
AI to play a game The judge AI practices
generating principles and critiques and
its final judgment is based on if it
matches the correct judgment and then of
course it gets a reward and it learns to
do that more So over time they're
reinforcing the good behavior from this
model And the reason they're doing this
is because this teaches the judge AI to
generate good principles and critiques
that lead to accurate judgments that
over the time make it smarter Now of
course making it better on the spot is
where we have the inference scaling So
you know that famous famous paradigm
that everyone is talking about is once
again making headwinds again So what
they do here is they ask it multiple
times which is called sampling So when
they want a judge over here to get an
answer they ask their trained judge
multiple times So they'll ask it eight
or 32 times Then what they'll do is
they'll then combine those judgments
which is where they'll have voting So
they'll collect all the scores from the
multiple tries and combine them into you
know I guess you could say an average
And all of this is happening at
inference time They then have smart
combining which is where they trained
another tiny AI called a metarm whose
only job is to quickly rate how good
each written critique from the main
judge was Then they combine the scores
only from the critiques that the Metar
RM thought were good And this is because
asking multiple times and combining
results especially the meta RM helper
makes the final judgment much more
reliable than just asking once And this
is much more compute inensive but it
does give a better result Now overall
you might be thinking all right that is
pretty interesting but what are the
results here so the results are that
this of course performs really well The
AI judge works very well across many
different tasks The ask multiple times
strategy dramatically improves the AI's
accuracy and the more times they ask it
or the more compute they use the better
it gets And their judge AI even though
medium-sized could outperform much
larger AIs like
GPT40 used as a judge And if those
larger AIs were only asked once so
overall it's looking really good And
using the meta RM that tiny AI that
little helper to pick out the best
critiques works even better than simple
voting They basically built a really
smart AI judge that explains its
reasoning They trained it in a really
clever way And crucially this judge gets
better if you let it think multiple
times which is just the same as using
more compute And this allows a
moderately sized AI judge to achieve top
performance when needed So overall this
is once again something really
interesting because we know that
Deepseek are continuing to push the
frontier in terms of innovation And I
will say it's really interesting that
we're seeing that the tides are shifting
It's not like China is copying the West
anymore They're actually innovating on
their own front Now one thing that I do
want to talk about is of course the fact
that R2 is coming We do know that Deep
Seek's new Frontier models are going to
be coming fairly soon And with this
recent research paper I am wondering if
that is going to be part of their next
AI release We know that they are working
super hard and some would argue that
they're even ahead of recent Frontier
Labs I mean a lot of people know about
Meta's Llama 4 and how there have been
many many different mishaps with regards
to how that model has performed So
Deepseek rushing to launch a new AI
model and they are going to be going
allin and this article was in February
2025 So I am wondering when this model
is going to be released because we still
do have many other open-source AIs like
Quen coming soon and potentially even
GPT5 and 03 Mini So it is super super
important that Deepseek do maintain the
momentum as it is quite hard to stay
relevant in the AI space Now they do
state that it is possible that they will
be releasing the model as early as May
because that was when their plan was
They plan to release it in May but now
they want it as early as possible So I
wouldn't personally be surprised if we
do get Deep Seek R2 potentially at the
end of this month I mean it's a once in
a-lifetime opportunity to put your
company out even further ahead of OpenAI
and steal the limelight once again I do
think that this could potentially happen
but of course we will have to see And
individuals within the industry are
saying that the launch of Deep Seek's R2
model could be a pivotal moment in the
AI industry And I do wonder how this
news of Deepseek's further advancements
are going to affect companies like Meta
and their open- source efforts Of course
like I said before if you watched
yesterday's video you'll know that there
was a ton of controversy around the
Llama 4 release with many critics
arguing that they fixed the benchmarks
and the model just simply isn't that
good So it will be interesting to see
exactly how that announcement of R2 when
it is released affects the AI industry