0:00:00.080,0:00:01.839
so are Deepseek creating a

0:00:01.839,0:00:04.000
self-improving AI well that's the latest

0:00:04.000,0:00:06.080
claim coming out of this newspaper and

0:00:06.080,0:00:08.160
they're talking about the fact that last

0:00:08.160,0:00:10.320
Friday Deepseek released a paper for

0:00:10.320,0:00:13.120
self-improving AI models Now there are a

0:00:13.120,0:00:15.040
little bit of caveats to this I mean

0:00:15.040,0:00:16.800
there is some truth statements to this

0:00:16.800,0:00:19.119
but I'm going to dive into how this all

0:00:19.119,0:00:21.199
works and the truth about this claim So

0:00:21.199,0:00:23.039
you can see right here that the article

0:00:23.039,0:00:25.359
starts by saying "China's Deepseek finds

0:00:25.359,0:00:27.760
a way to help AI get better at answering

0:00:27.760,0:00:30.960
questions Here's how it works So this

0:00:30.960,0:00:33.760
actually got the Twitter sphere in a

0:00:33.760,0:00:35.920
hold because so many people were

0:00:35.920,0:00:38.800
wondering exactly what on earth this is

0:00:38.800,0:00:40.879
and how on earth DeepS are actually

0:00:40.879,0:00:43.200
changing the game when it comes to

0:00:43.200,0:00:45.760
having an AI improve itself Now we all

0:00:45.760,0:00:48.399
know that Deepseek are an open research

0:00:48.399,0:00:51.360
company or open source company and of

0:00:51.360,0:00:53.199
course that means that they publish

0:00:53.199,0:00:55.680
their research So right here you can see

0:00:55.680,0:00:58.640
exactly the research that was published

0:00:58.640,0:01:00.640
and it talks about inference time

0:01:00.640,0:01:03.520
scaling for generous reward modeling Now

0:01:03.520,0:01:05.040
trust me guys I'm not going to bore you

0:01:05.040,0:01:06.799
with the specifics of everything I'll

0:01:06.799,0:01:08.640
actually explain this to you in a way

0:01:08.640,0:01:11.280
that you can understand But the results

0:01:11.280,0:01:14.320
that we are seeing is that yes over time

0:01:14.320,0:01:17.520
with inference time scaling the model

0:01:17.520,0:01:20.159
does improve So what we can see here and

0:01:20.159,0:01:21.759
trust me guys it's actually not that

0:01:21.759,0:01:24.240
hard to understand is that every time

0:01:24.240,0:01:26.400
you sample the model we show how

0:01:26.400,0:01:28.799
accurate that rating becomes So let's

0:01:28.799,0:01:31.119
say for example we were asking the AI

0:01:31.119,0:01:32.799
assistants which are the reward models

0:01:32.799,0:01:35.280
in this case to rate how good this AI's

0:01:35.280,0:01:37.119
response was The graph here is showing

0:01:37.119,0:01:39.520
how accurate these results get over time

0:01:39.520,0:01:41.680
The numbers here show the AI's

0:01:41.680,0:01:43.280
performance and the numbers here show

0:01:43.280,0:01:45.280
how many times it tried Now this isn't

0:01:45.280,0:01:47.280
the only graph that shows you just how

0:01:47.280,0:01:50.000
incredible this is because I think it's

0:01:50.000,0:01:51.680
really important that they're also

0:01:51.680,0:01:54.000
showing us the other models here And

0:01:54.000,0:01:56.799
they show GPT4 here which is really

0:01:56.799,0:01:58.640
interesting Although I will say I'm not

0:01:58.640,0:02:01.360
sure which version of GPT40 this is as

0:02:01.360,0:02:03.079
there are multiple different versions of

0:02:03.079,0:02:06.640
GPT40 that currently do exist So this

0:02:06.640,0:02:09.039
one outperforming it with inference time

0:02:09.039,0:02:11.160
is a little bit of a caveat because

0:02:11.160,0:02:13.760
GPT40 isn't actually an inference time

0:02:13.760,0:02:15.760
model I'm guessing they're just showing

0:02:15.760,0:02:18.000
just how good the model gets in terms of

0:02:18.000,0:02:20.400
overall performance So actually we'll

0:02:20.400,0:02:22.000
explain to you guys how this works

0:02:22.000,0:02:24.080
because I think this probably is going

0:02:24.080,0:02:27.040
to be used in the base of the next model

0:02:27.040,0:02:29.440
Deepseek R2 which is probably going to

0:02:29.440,0:02:31.760
be Deep Seek's next Frontier model that

0:02:31.760,0:02:34.480
could outperform the next level of AI So

0:02:34.480,0:02:36.640
how this entire thing works and I'm just

0:02:36.640,0:02:38.560
going to break this down for you guys is

0:02:38.560,0:02:41.120
the goal here is to get the AI to

0:02:41.120,0:02:43.040
improve Let's just say in this example

0:02:43.040,0:02:45.599
chat GBT So now if they want to get the

0:02:45.599,0:02:48.160
AI to improve is to train it to have

0:02:48.160,0:02:51.200
another AI actually judge how good the

0:02:51.200,0:02:53.599
answers are And this judge is basically

0:02:53.599,0:02:56.720
called a reward model And this AI paper

0:02:56.720,0:02:59.120
basically wants to create a very good

0:02:59.120,0:03:01.599
versatile judge Now there's a few

0:03:01.599,0:03:04.319
problems with current AI judges And

0:03:04.319,0:03:06.080
number one is that they're not general

0:03:06.080,0:03:08.000
enough They might be good at judging

0:03:08.000,0:03:10.800
math answers but when you know you judge

0:03:10.800,0:03:13.519
creative answers or vice versa they just

0:03:13.519,0:03:15.440
don't perform that well And now another

0:03:15.440,0:03:18.319
thing as well is that these AI judges

0:03:18.319,0:03:20.720
they don't improve that much on the spot

0:03:20.720,0:03:23.360
So giving them more computer power with

0:03:23.360,0:03:24.879
you know inference time doesn't

0:03:24.879,0:03:27.040
necessarily make their judgment much

0:03:27.040,0:03:30.239
better They actually improve most times

0:03:30.239,0:03:32.560
after training And so because we have

0:03:32.560,0:03:35.360
these ingrained issues into the judges

0:03:35.360,0:03:36.959
when we're trying to improve the model's

0:03:36.959,0:03:39.840
responses with these reward models we of

0:03:39.840,0:03:41.760
course have to come up with a better

0:03:41.760,0:03:44.319
solution And that's where Deepseek

0:03:44.319,0:03:46.560
solution comes into the picture This is

0:03:46.560,0:03:49.599
where we get the GRM judge So the paper

0:03:49.599,0:03:52.879
solution is called Deepseek GRM Now this

0:03:52.879,0:03:54.959
is a different kind of judge Instead of

0:03:54.959,0:03:56.879
just outputting a score like seven out

0:03:56.879,0:03:59.120
of 10 their judge writes out the

0:03:59.120,0:04:02.159
reasoning and then it explains why an

0:04:02.159,0:04:04.959
answer is good or bad based on specific

0:04:04.959,0:04:06.560
principles and these are going to be the

0:04:06.560,0:04:08.480
rules that it manages to come up with

0:04:08.480,0:04:10.799
Then from that reasoning the score is

0:04:10.799,0:04:12.560
then extracted Now the reason that

0:04:12.560,0:04:14.400
they've chosen to do this is because

0:04:14.400,0:04:16.560
this is more flexible It's more detailed

0:04:16.560,0:04:18.479
and more importantly if you ask it to

0:04:18.479,0:04:20.560
judge the same thing multiple times it

0:04:20.560,0:04:22.960
might write multiple slightly different

0:04:22.960,0:04:24.880
reasons or principles leading to

0:04:24.880,0:04:26.720
slightly different scores And so what

0:04:26.720,0:04:28.800
they actually use is they train this

0:04:28.800,0:04:32.400
judge with SPCT So this is where they

0:04:32.400,0:04:34.240
use reinforcement learning which is

0:04:34.240,0:04:36.000
where you know you can like train an AI

0:04:36.000,0:04:37.440
to play a game which is like training an

0:04:37.440,0:04:39.759
AI to play a game The judge AI practices

0:04:39.759,0:04:41.759
generating principles and critiques and

0:04:41.759,0:04:44.160
its final judgment is based on if it

0:04:44.160,0:04:46.080
matches the correct judgment and then of

0:04:46.080,0:04:48.160
course it gets a reward and it learns to

0:04:48.160,0:04:49.680
do that more So over time they're

0:04:49.680,0:04:52.720
reinforcing the good behavior from this

0:04:52.720,0:04:54.479
model And the reason they're doing this

0:04:54.479,0:04:57.280
is because this teaches the judge AI to

0:04:57.280,0:04:59.600
generate good principles and critiques

0:04:59.600,0:05:01.680
that lead to accurate judgments that

0:05:01.680,0:05:04.240
over the time make it smarter Now of

0:05:04.240,0:05:06.720
course making it better on the spot is

0:05:06.720,0:05:09.680
where we have the inference scaling So

0:05:09.680,0:05:11.759
you know that famous famous paradigm

0:05:11.759,0:05:14.320
that everyone is talking about is once

0:05:14.320,0:05:16.720
again making headwinds again So what

0:05:16.720,0:05:18.960
they do here is they ask it multiple

0:05:18.960,0:05:21.120
times which is called sampling So when

0:05:21.120,0:05:23.520
they want a judge over here to get an

0:05:23.520,0:05:25.759
answer they ask their trained judge

0:05:25.759,0:05:27.520
multiple times So they'll ask it eight

0:05:27.520,0:05:30.320
or 32 times Then what they'll do is

0:05:30.320,0:05:32.400
they'll then combine those judgments

0:05:32.400,0:05:34.240
which is where they'll have voting So

0:05:34.240,0:05:36.080
they'll collect all the scores from the

0:05:36.080,0:05:38.639
multiple tries and combine them into you

0:05:38.639,0:05:40.880
know I guess you could say an average

0:05:40.880,0:05:41.919
And all of this is happening at

0:05:41.919,0:05:44.160
inference time They then have smart

0:05:44.160,0:05:46.240
combining which is where they trained

0:05:46.240,0:05:50.000
another tiny AI called a metarm whose

0:05:50.000,0:05:52.000
only job is to quickly rate how good

0:05:52.000,0:05:53.840
each written critique from the main

0:05:53.840,0:05:56.479
judge was Then they combine the scores

0:05:56.479,0:05:58.240
only from the critiques that the Metar

0:05:58.240,0:06:00.080
RM thought were good And this is because

0:06:00.080,0:06:01.840
asking multiple times and combining

0:06:01.840,0:06:04.240
results especially the meta RM helper

0:06:04.240,0:06:06.000
makes the final judgment much more

0:06:06.000,0:06:08.080
reliable than just asking once And this

0:06:08.080,0:06:11.039
is much more compute inensive but it

0:06:11.039,0:06:12.960
does give a better result Now overall

0:06:12.960,0:06:14.880
you might be thinking all right that is

0:06:14.880,0:06:16.639
pretty interesting but what are the

0:06:16.639,0:06:18.400
results here so the results are that

0:06:18.400,0:06:20.560
this of course performs really well The

0:06:20.560,0:06:23.680
AI judge works very well across many

0:06:23.680,0:06:25.840
different tasks The ask multiple times

0:06:25.840,0:06:29.280
strategy dramatically improves the AI's

0:06:29.280,0:06:31.759
accuracy and the more times they ask it

0:06:31.759,0:06:33.840
or the more compute they use the better

0:06:33.840,0:06:36.919
it gets And their judge AI even though

0:06:36.919,0:06:39.840
medium-sized could outperform much

0:06:39.840,0:06:41.560
larger AIs like

0:06:41.560,0:06:44.479
GPT40 used as a judge And if those

0:06:44.479,0:06:46.880
larger AIs were only asked once so

0:06:46.880,0:06:48.880
overall it's looking really good And

0:06:48.880,0:06:51.280
using the meta RM that tiny AI that

0:06:51.280,0:06:52.720
little helper to pick out the best

0:06:52.720,0:06:54.960
critiques works even better than simple

0:06:54.960,0:06:56.639
voting They basically built a really

0:06:56.639,0:06:58.400
smart AI judge that explains its

0:06:58.400,0:07:00.240
reasoning They trained it in a really

0:07:00.240,0:07:02.479
clever way And crucially this judge gets

0:07:02.479,0:07:04.319
better if you let it think multiple

0:07:04.319,0:07:06.400
times which is just the same as using

0:07:06.400,0:07:07.919
more compute And this allows a

0:07:07.919,0:07:10.400
moderately sized AI judge to achieve top

0:07:10.400,0:07:13.440
performance when needed So overall this

0:07:13.440,0:07:15.520
is once again something really

0:07:15.520,0:07:17.120
interesting because we know that

0:07:17.120,0:07:19.199
Deepseek are continuing to push the

0:07:19.199,0:07:21.120
frontier in terms of innovation And I

0:07:21.120,0:07:23.280
will say it's really interesting that

0:07:23.280,0:07:25.759
we're seeing that the tides are shifting

0:07:25.759,0:07:27.759
It's not like China is copying the West

0:07:27.759,0:07:29.520
anymore They're actually innovating on

0:07:29.520,0:07:31.759
their own front Now one thing that I do

0:07:31.759,0:07:33.759
want to talk about is of course the fact

0:07:33.759,0:07:37.039
that R2 is coming We do know that Deep

0:07:37.039,0:07:39.360
Seek's new Frontier models are going to

0:07:39.360,0:07:43.039
be coming fairly soon And with this

0:07:43.039,0:07:45.840
recent research paper I am wondering if

0:07:45.840,0:07:48.000
that is going to be part of their next

0:07:48.000,0:07:50.080
AI release We know that they are working

0:07:50.080,0:07:52.240
super hard and some would argue that

0:07:52.240,0:07:54.400
they're even ahead of recent Frontier

0:07:54.400,0:07:56.879
Labs I mean a lot of people know about

0:07:56.879,0:07:59.520
Meta's Llama 4 and how there have been

0:07:59.520,0:08:01.840
many many different mishaps with regards

0:08:01.840,0:08:04.879
to how that model has performed So

0:08:04.879,0:08:07.599
Deepseek rushing to launch a new AI

0:08:07.599,0:08:09.120
model and they are going to be going

0:08:09.120,0:08:12.360
allin and this article was in February

0:08:12.360,0:08:15.440
2025 So I am wondering when this model

0:08:15.440,0:08:17.120
is going to be released because we still

0:08:17.120,0:08:19.599
do have many other open-source AIs like

0:08:19.599,0:08:22.360
Quen coming soon and potentially even

0:08:22.360,0:08:26.160
GPT5 and 03 Mini So it is super super

0:08:26.160,0:08:28.319
important that Deepseek do maintain the

0:08:28.319,0:08:30.400
momentum as it is quite hard to stay

0:08:30.400,0:08:32.880
relevant in the AI space Now they do

0:08:32.880,0:08:36.159
state that it is possible that they will

0:08:36.159,0:08:39.279
be releasing the model as early as May

0:08:39.279,0:08:41.279
because that was when their plan was

0:08:41.279,0:08:43.519
They plan to release it in May but now

0:08:43.519,0:08:46.080
they want it as early as possible So I

0:08:46.080,0:08:47.920
wouldn't personally be surprised if we

0:08:47.920,0:08:50.560
do get Deep Seek R2 potentially at the

0:08:50.560,0:08:53.360
end of this month I mean it's a once in

0:08:53.360,0:08:55.440
a-lifetime opportunity to put your

0:08:55.440,0:08:58.320
company out even further ahead of OpenAI

0:08:58.320,0:09:00.640
and steal the limelight once again I do

0:09:00.640,0:09:02.080
think that this could potentially happen

0:09:02.080,0:09:04.240
but of course we will have to see And

0:09:04.240,0:09:06.080
individuals within the industry are

0:09:06.080,0:09:08.240
saying that the launch of Deep Seek's R2

0:09:08.240,0:09:10.480
model could be a pivotal moment in the

0:09:10.480,0:09:12.800
AI industry And I do wonder how this

0:09:12.800,0:09:15.360
news of Deepseek's further advancements

0:09:15.360,0:09:17.360
are going to affect companies like Meta

0:09:17.360,0:09:19.040
and their open- source efforts Of course

0:09:19.040,0:09:20.320
like I said before if you watched

0:09:20.320,0:09:22.480
yesterday's video you'll know that there

0:09:22.480,0:09:24.800
was a ton of controversy around the

0:09:24.800,0:09:26.959
Llama 4 release with many critics

0:09:26.959,0:09:29.120
arguing that they fixed the benchmarks

0:09:29.120,0:09:30.959
and the model just simply isn't that

0:09:30.959,0:09:32.800
good So it will be interesting to see

0:09:32.800,0:09:35.600
exactly how that announcement of R2 when

0:09:35.600,0:09:39.360
it is released affects the AI industry