so are Deepseek creating a

self-improving AI well that's the latest

claim coming out of this newspaper and

they're talking about the fact that last

Friday Deepseek released a paper for

self-improving AI models Now there are a

little bit of caveats to this I mean

there is some truth statements to this

but I'm going to dive into how this all

works and the truth about this claim So

you can see right here that the article

starts by saying "China's Deepseek finds

a way to help AI get better at answering

questions Here's how it works So this

actually got the Twitter sphere in a

hold because so many people were

wondering exactly what on earth this is

and how on earth DeepS are actually

changing the game when it comes to

having an AI improve itself Now we all

know that Deepseek are an open research

company or open source company and of

course that means that they publish

their research So right here you can see

exactly the research that was published

and it talks about inference time

scaling for generous reward modeling Now

trust me guys I'm not going to bore you

with the specifics of everything I'll

actually explain this to you in a way

that you can understand But the results

that we are seeing is that yes over time

with inference time scaling the model

does improve So what we can see here and

trust me guys it's actually not that

hard to understand is that every time

you sample the model we show how

accurate that rating becomes So let's

say for example we were asking the AI

assistants which are the reward models

in this case to rate how good this AI's

response was The graph here is showing

how accurate these results get over time

The numbers here show the AI's

performance and the numbers here show

how many times it tried Now this isn't

the only graph that shows you just how

incredible this is because I think it's

really important that they're also

showing us the other models here And

they show GPT4 here which is really

interesting Although I will say I'm not

sure which version of GPT40 this is as

there are multiple different versions of

GPT40 that currently do exist So this

one outperforming it with inference time

is a little bit of a caveat because

GPT40 isn't actually an inference time

model I'm guessing they're just showing

just how good the model gets in terms of

overall performance So actually we'll

explain to you guys how this works

because I think this probably is going

to be used in the base of the next model

Deepseek R2 which is probably going to

be Deep Seek's next Frontier model that

could outperform the next level of AI So

how this entire thing works and I'm just

going to break this down for you guys is

the goal here is to get the AI to

improve Let's just say in this example

chat GBT So now if they want to get the

AI to improve is to train it to have

another AI actually judge how good the

answers are And this judge is basically

called a reward model And this AI paper

basically wants to create a very good

versatile judge Now there's a few

problems with current AI judges And

number one is that they're not general

enough They might be good at judging

math answers but when you know you judge

creative answers or vice versa they just

don't perform that well And now another

thing as well is that these AI judges

they don't improve that much on the spot

So giving them more computer power with

you know inference time doesn't

necessarily make their judgment much

better They actually improve most times

after training And so because we have

these ingrained issues into the judges

when we're trying to improve the model's

responses with these reward models we of

course have to come up with a better

solution And that's where Deepseek

solution comes into the picture This is

where we get the GRM judge So the paper

solution is called Deepseek GRM Now this

is a different kind of judge Instead of

just outputting a score like seven out

of 10 their judge writes out the

reasoning and then it explains why an

answer is good or bad based on specific

principles and these are going to be the

rules that it manages to come up with

Then from that reasoning the score is

then extracted Now the reason that

they've chosen to do this is because

this is more flexible It's more detailed

and more importantly if you ask it to

judge the same thing multiple times it

might write multiple slightly different

reasons or principles leading to

slightly different scores And so what

they actually use is they train this

judge with SPCT So this is where they

use reinforcement learning which is

where you know you can like train an AI

to play a game which is like training an

AI to play a game The judge AI practices

generating principles and critiques and

its final judgment is based on if it

matches the correct judgment and then of

course it gets a reward and it learns to

do that more So over time they're

reinforcing the good behavior from this

model And the reason they're doing this

is because this teaches the judge AI to

generate good principles and critiques

that lead to accurate judgments that

over the time make it smarter Now of

course making it better on the spot is

where we have the inference scaling So

you know that famous famous paradigm

that everyone is talking about is once

again making headwinds again So what

they do here is they ask it multiple

times which is called sampling So when

they want a judge over here to get an

answer they ask their trained judge

multiple times So they'll ask it eight

or 32 times Then what they'll do is

they'll then combine those judgments

which is where they'll have voting So

they'll collect all the scores from the

multiple tries and combine them into you

know I guess you could say an average

And all of this is happening at

inference time They then have smart

combining which is where they trained

another tiny AI called a metarm whose

only job is to quickly rate how good

each written critique from the main

judge was Then they combine the scores

only from the critiques that the Metar

RM thought were good And this is because

asking multiple times and combining

results especially the meta RM helper

makes the final judgment much more

reliable than just asking once And this

is much more compute inensive but it

does give a better result Now overall

you might be thinking all right that is

pretty interesting but what are the

results here so the results are that

this of course performs really well The

AI judge works very well across many

different tasks The ask multiple times

strategy dramatically improves the AI's

accuracy and the more times they ask it

or the more compute they use the better

it gets And their judge AI even though

medium-sized could outperform much

larger AIs like

GPT40 used as a judge And if those

larger AIs were only asked once so

overall it's looking really good And

using the meta RM that tiny AI that

little helper to pick out the best

critiques works even better than simple

voting They basically built a really

smart AI judge that explains its

reasoning They trained it in a really

clever way And crucially this judge gets

better if you let it think multiple

times which is just the same as using

more compute And this allows a

moderately sized AI judge to achieve top

performance when needed So overall this

is once again something really

interesting because we know that

Deepseek are continuing to push the

frontier in terms of innovation And I

will say it's really interesting that

we're seeing that the tides are shifting

It's not like China is copying the West

anymore They're actually innovating on

their own front Now one thing that I do

want to talk about is of course the fact

that R2 is coming We do know that Deep

Seek's new Frontier models are going to

be coming fairly soon And with this

recent research paper I am wondering if

that is going to be part of their next

AI release We know that they are working

super hard and some would argue that

they're even ahead of recent Frontier

Labs I mean a lot of people know about

Meta's Llama 4 and how there have been

many many different mishaps with regards

to how that model has performed So

Deepseek rushing to launch a new AI

model and they are going to be going

allin and this article was in February

2025 So I am wondering when this model

is going to be released because we still

do have many other open-source AIs like

Quen coming soon and potentially even

GPT5 and 03 Mini So it is super super

important that Deepseek do maintain the

momentum as it is quite hard to stay

relevant in the AI space Now they do

state that it is possible that they will

be releasing the model as early as May

because that was when their plan was

They plan to release it in May but now

they want it as early as possible So I

wouldn't personally be surprised if we

do get Deep Seek R2 potentially at the

end of this month I mean it's a once in

a-lifetime opportunity to put your

company out even further ahead of OpenAI

and steal the limelight once again I do

think that this could potentially happen

but of course we will have to see And

individuals within the industry are

saying that the launch of Deep Seek's R2

model could be a pivotal moment in the

AI industry And I do wonder how this

news of Deepseek's further advancements

are going to affect companies like Meta

and their open- source efforts Of course

like I said before if you watched

yesterday's video you'll know that there

was a ton of controversy around the

Llama 4 release with many critics

arguing that they fixed the benchmarks

and the model just simply isn't that

good So it will be interesting to see

exactly how that announcement of R2 when

it is released affects the AI industry