WEBVTT

00:00:00.080 --> 00:00:01.839
so are Deepseek creating a

00:00:01.839 --> 00:00:04.000
self-improving AI well that's the latest

00:00:04.000 --> 00:00:06.080
claim coming out of this newspaper and

00:00:06.080 --> 00:00:08.160
they're talking about the fact that last

00:00:08.160 --> 00:00:10.320
Friday Deepseek released a paper for

00:00:10.320 --> 00:00:13.120
self-improving AI models Now there are a

00:00:13.120 --> 00:00:15.040
little bit of caveats to this I mean

00:00:15.040 --> 00:00:16.800
there is some truth statements to this

00:00:16.800 --> 00:00:19.119
but I'm going to dive into how this all

00:00:19.119 --> 00:00:21.199
works and the truth about this claim So

00:00:21.199 --> 00:00:23.039
you can see right here that the article

00:00:23.039 --> 00:00:25.359
starts by saying "China's Deepseek finds

00:00:25.359 --> 00:00:27.760
a way to help AI get better at answering

00:00:27.760 --> 00:00:30.960
questions Here's how it works So this

00:00:30.960 --> 00:00:33.760
actually got the Twitter sphere in a

00:00:33.760 --> 00:00:35.920
hold because so many people were

00:00:35.920 --> 00:00:38.800
wondering exactly what on earth this is

00:00:38.800 --> 00:00:40.879
and how on earth DeepS are actually

00:00:40.879 --> 00:00:43.200
changing the game when it comes to

00:00:43.200 --> 00:00:45.760
having an AI improve itself Now we all

00:00:45.760 --> 00:00:48.399
know that Deepseek are an open research

00:00:48.399 --> 00:00:51.360
company or open source company and of

00:00:51.360 --> 00:00:53.199
course that means that they publish

00:00:53.199 --> 00:00:55.680
their research So right here you can see

00:00:55.680 --> 00:00:58.640
exactly the research that was published

00:00:58.640 --> 00:01:00.640
and it talks about inference time

00:01:00.640 --> 00:01:03.520
scaling for generous reward modeling Now

00:01:03.520 --> 00:01:05.040
trust me guys I'm not going to bore you

00:01:05.040 --> 00:01:06.799
with the specifics of everything I'll

00:01:06.799 --> 00:01:08.640
actually explain this to you in a way

00:01:08.640 --> 00:01:11.280
that you can understand But the results

00:01:11.280 --> 00:01:14.320
that we are seeing is that yes over time

00:01:14.320 --> 00:01:17.520
with inference time scaling the model

00:01:17.520 --> 00:01:20.159
does improve So what we can see here and

00:01:20.159 --> 00:01:21.759
trust me guys it's actually not that

00:01:21.759 --> 00:01:24.240
hard to understand is that every time

00:01:24.240 --> 00:01:26.400
you sample the model we show how

00:01:26.400 --> 00:01:28.799
accurate that rating becomes So let's

00:01:28.799 --> 00:01:31.119
say for example we were asking the AI

00:01:31.119 --> 00:01:32.799
assistants which are the reward models

00:01:32.799 --> 00:01:35.280
in this case to rate how good this AI's

00:01:35.280 --> 00:01:37.119
response was The graph here is showing

00:01:37.119 --> 00:01:39.520
how accurate these results get over time

00:01:39.520 --> 00:01:41.680
The numbers here show the AI's

00:01:41.680 --> 00:01:43.280
performance and the numbers here show

00:01:43.280 --> 00:01:45.280
how many times it tried Now this isn't

00:01:45.280 --> 00:01:47.280
the only graph that shows you just how

00:01:47.280 --> 00:01:50.000
incredible this is because I think it's

00:01:50.000 --> 00:01:51.680
really important that they're also

00:01:51.680 --> 00:01:54.000
showing us the other models here And

00:01:54.000 --> 00:01:56.799
they show GPT4 here which is really

00:01:56.799 --> 00:01:58.640
interesting Although I will say I'm not

00:01:58.640 --> 00:02:01.360
sure which version of GPT40 this is as

00:02:01.360 --> 00:02:03.079
there are multiple different versions of

00:02:03.079 --> 00:02:06.640
GPT40 that currently do exist So this

00:02:06.640 --> 00:02:09.039
one outperforming it with inference time

00:02:09.039 --> 00:02:11.160
is a little bit of a caveat because

00:02:11.160 --> 00:02:13.760
GPT40 isn't actually an inference time

00:02:13.760 --> 00:02:15.760
model I'm guessing they're just showing

00:02:15.760 --> 00:02:18.000
just how good the model gets in terms of

00:02:18.000 --> 00:02:20.400
overall performance So actually we'll

00:02:20.400 --> 00:02:22.000
explain to you guys how this works

00:02:22.000 --> 00:02:24.080
because I think this probably is going

00:02:24.080 --> 00:02:27.040
to be used in the base of the next model

00:02:27.040 --> 00:02:29.440
Deepseek R2 which is probably going to

00:02:29.440 --> 00:02:31.760
be Deep Seek's next Frontier model that

00:02:31.760 --> 00:02:34.480
could outperform the next level of AI So

00:02:34.480 --> 00:02:36.640
how this entire thing works and I'm just

00:02:36.640 --> 00:02:38.560
going to break this down for you guys is

00:02:38.560 --> 00:02:41.120
the goal here is to get the AI to

00:02:41.120 --> 00:02:43.040
improve Let's just say in this example

00:02:43.040 --> 00:02:45.599
chat GBT So now if they want to get the

00:02:45.599 --> 00:02:48.160
AI to improve is to train it to have

00:02:48.160 --> 00:02:51.200
another AI actually judge how good the

00:02:51.200 --> 00:02:53.599
answers are And this judge is basically

00:02:53.599 --> 00:02:56.720
called a reward model And this AI paper

00:02:56.720 --> 00:02:59.120
basically wants to create a very good

00:02:59.120 --> 00:03:01.599
versatile judge Now there's a few

00:03:01.599 --> 00:03:04.319
problems with current AI judges And

00:03:04.319 --> 00:03:06.080
number one is that they're not general

00:03:06.080 --> 00:03:08.000
enough They might be good at judging

00:03:08.000 --> 00:03:10.800
math answers but when you know you judge

00:03:10.800 --> 00:03:13.519
creative answers or vice versa they just

00:03:13.519 --> 00:03:15.440
don't perform that well And now another

00:03:15.440 --> 00:03:18.319
thing as well is that these AI judges

00:03:18.319 --> 00:03:20.720
they don't improve that much on the spot

00:03:20.720 --> 00:03:23.360
So giving them more computer power with

00:03:23.360 --> 00:03:24.879
you know inference time doesn't

00:03:24.879 --> 00:03:27.040
necessarily make their judgment much

00:03:27.040 --> 00:03:30.239
better They actually improve most times

00:03:30.239 --> 00:03:32.560
after training And so because we have

00:03:32.560 --> 00:03:35.360
these ingrained issues into the judges

00:03:35.360 --> 00:03:36.959
when we're trying to improve the model's

00:03:36.959 --> 00:03:39.840
responses with these reward models we of

00:03:39.840 --> 00:03:41.760
course have to come up with a better

00:03:41.760 --> 00:03:44.319
solution And that's where Deepseek

00:03:44.319 --> 00:03:46.560
solution comes into the picture This is

00:03:46.560 --> 00:03:49.599
where we get the GRM judge So the paper

00:03:49.599 --> 00:03:52.879
solution is called Deepseek GRM Now this

00:03:52.879 --> 00:03:54.959
is a different kind of judge Instead of

00:03:54.959 --> 00:03:56.879
just outputting a score like seven out

00:03:56.879 --> 00:03:59.120
of 10 their judge writes out the

00:03:59.120 --> 00:04:02.159
reasoning and then it explains why an

00:04:02.159 --> 00:04:04.959
answer is good or bad based on specific

00:04:04.959 --> 00:04:06.560
principles and these are going to be the

00:04:06.560 --> 00:04:08.480
rules that it manages to come up with

00:04:08.480 --> 00:04:10.799
Then from that reasoning the score is

00:04:10.799 --> 00:04:12.560
then extracted Now the reason that

00:04:12.560 --> 00:04:14.400
they've chosen to do this is because

00:04:14.400 --> 00:04:16.560
this is more flexible It's more detailed

00:04:16.560 --> 00:04:18.479
and more importantly if you ask it to

00:04:18.479 --> 00:04:20.560
judge the same thing multiple times it

00:04:20.560 --> 00:04:22.960
might write multiple slightly different

00:04:22.960 --> 00:04:24.880
reasons or principles leading to

00:04:24.880 --> 00:04:26.720
slightly different scores And so what

00:04:26.720 --> 00:04:28.800
they actually use is they train this

00:04:28.800 --> 00:04:32.400
judge with SPCT So this is where they

00:04:32.400 --> 00:04:34.240
use reinforcement learning which is

00:04:34.240 --> 00:04:36.000
where you know you can like train an AI

00:04:36.000 --> 00:04:37.440
to play a game which is like training an

00:04:37.440 --> 00:04:39.759
AI to play a game The judge AI practices

00:04:39.759 --> 00:04:41.759
generating principles and critiques and

00:04:41.759 --> 00:04:44.160
its final judgment is based on if it

00:04:44.160 --> 00:04:46.080
matches the correct judgment and then of

00:04:46.080 --> 00:04:48.160
course it gets a reward and it learns to

00:04:48.160 --> 00:04:49.680
do that more So over time they're

00:04:49.680 --> 00:04:52.720
reinforcing the good behavior from this

00:04:52.720 --> 00:04:54.479
model And the reason they're doing this

00:04:54.479 --> 00:04:57.280
is because this teaches the judge AI to

00:04:57.280 --> 00:04:59.600
generate good principles and critiques

00:04:59.600 --> 00:05:01.680
that lead to accurate judgments that

00:05:01.680 --> 00:05:04.240
over the time make it smarter Now of

00:05:04.240 --> 00:05:06.720
course making it better on the spot is

00:05:06.720 --> 00:05:09.680
where we have the inference scaling So

00:05:09.680 --> 00:05:11.759
you know that famous famous paradigm

00:05:11.759 --> 00:05:14.320
that everyone is talking about is once

00:05:14.320 --> 00:05:16.720
again making headwinds again So what

00:05:16.720 --> 00:05:18.960
they do here is they ask it multiple

00:05:18.960 --> 00:05:21.120
times which is called sampling So when

00:05:21.120 --> 00:05:23.520
they want a judge over here to get an

00:05:23.520 --> 00:05:25.759
answer they ask their trained judge

00:05:25.759 --> 00:05:27.520
multiple times So they'll ask it eight

00:05:27.520 --> 00:05:30.320
or 32 times Then what they'll do is

00:05:30.320 --> 00:05:32.400
they'll then combine those judgments

00:05:32.400 --> 00:05:34.240
which is where they'll have voting So

00:05:34.240 --> 00:05:36.080
they'll collect all the scores from the

00:05:36.080 --> 00:05:38.639
multiple tries and combine them into you

00:05:38.639 --> 00:05:40.880
know I guess you could say an average

00:05:40.880 --> 00:05:41.919
And all of this is happening at

00:05:41.919 --> 00:05:44.160
inference time They then have smart

00:05:44.160 --> 00:05:46.240
combining which is where they trained

00:05:46.240 --> 00:05:50.000
another tiny AI called a metarm whose

00:05:50.000 --> 00:05:52.000
only job is to quickly rate how good

00:05:52.000 --> 00:05:53.840
each written critique from the main

00:05:53.840 --> 00:05:56.479
judge was Then they combine the scores

00:05:56.479 --> 00:05:58.240
only from the critiques that the Metar

00:05:58.240 --> 00:06:00.080
RM thought were good And this is because

00:06:00.080 --> 00:06:01.840
asking multiple times and combining

00:06:01.840 --> 00:06:04.240
results especially the meta RM helper

00:06:04.240 --> 00:06:06.000
makes the final judgment much more

00:06:06.000 --> 00:06:08.080
reliable than just asking once And this

00:06:08.080 --> 00:06:11.039
is much more compute inensive but it

00:06:11.039 --> 00:06:12.960
does give a better result Now overall

00:06:12.960 --> 00:06:14.880
you might be thinking all right that is

00:06:14.880 --> 00:06:16.639
pretty interesting but what are the

00:06:16.639 --> 00:06:18.400
results here so the results are that

00:06:18.400 --> 00:06:20.560
this of course performs really well The

00:06:20.560 --> 00:06:23.680
AI judge works very well across many

00:06:23.680 --> 00:06:25.840
different tasks The ask multiple times

00:06:25.840 --> 00:06:29.280
strategy dramatically improves the AI's

00:06:29.280 --> 00:06:31.759
accuracy and the more times they ask it

00:06:31.759 --> 00:06:33.840
or the more compute they use the better

00:06:33.840 --> 00:06:36.919
it gets And their judge AI even though

00:06:36.919 --> 00:06:39.840
medium-sized could outperform much

00:06:39.840 --> 00:06:41.560
larger AIs like

00:06:41.560 --> 00:06:44.479
GPT40 used as a judge And if those

00:06:44.479 --> 00:06:46.880
larger AIs were only asked once so

00:06:46.880 --> 00:06:48.880
overall it's looking really good And

00:06:48.880 --> 00:06:51.280
using the meta RM that tiny AI that

00:06:51.280 --> 00:06:52.720
little helper to pick out the best

00:06:52.720 --> 00:06:54.960
critiques works even better than simple

00:06:54.960 --> 00:06:56.639
voting They basically built a really

00:06:56.639 --> 00:06:58.400
smart AI judge that explains its

00:06:58.400 --> 00:07:00.240
reasoning They trained it in a really

00:07:00.240 --> 00:07:02.479
clever way And crucially this judge gets

00:07:02.479 --> 00:07:04.319
better if you let it think multiple

00:07:04.319 --> 00:07:06.400
times which is just the same as using

00:07:06.400 --> 00:07:07.919
more compute And this allows a

00:07:07.919 --> 00:07:10.400
moderately sized AI judge to achieve top

00:07:10.400 --> 00:07:13.440
performance when needed So overall this

00:07:13.440 --> 00:07:15.520
is once again something really

00:07:15.520 --> 00:07:17.120
interesting because we know that

00:07:17.120 --> 00:07:19.199
Deepseek are continuing to push the

00:07:19.199 --> 00:07:21.120
frontier in terms of innovation And I

00:07:21.120 --> 00:07:23.280
will say it's really interesting that

00:07:23.280 --> 00:07:25.759
we're seeing that the tides are shifting

00:07:25.759 --> 00:07:27.759
It's not like China is copying the West

00:07:27.759 --> 00:07:29.520
anymore They're actually innovating on

00:07:29.520 --> 00:07:31.759
their own front Now one thing that I do

00:07:31.759 --> 00:07:33.759
want to talk about is of course the fact

00:07:33.759 --> 00:07:37.039
that R2 is coming We do know that Deep

00:07:37.039 --> 00:07:39.360
Seek's new Frontier models are going to

00:07:39.360 --> 00:07:43.039
be coming fairly soon And with this

00:07:43.039 --> 00:07:45.840
recent research paper I am wondering if

00:07:45.840 --> 00:07:48.000
that is going to be part of their next

00:07:48.000 --> 00:07:50.080
AI release We know that they are working

00:07:50.080 --> 00:07:52.240
super hard and some would argue that

00:07:52.240 --> 00:07:54.400
they're even ahead of recent Frontier

00:07:54.400 --> 00:07:56.879
Labs I mean a lot of people know about

00:07:56.879 --> 00:07:59.520
Meta's Llama 4 and how there have been

00:07:59.520 --> 00:08:01.840
many many different mishaps with regards

00:08:01.840 --> 00:08:04.879
to how that model has performed So

00:08:04.879 --> 00:08:07.599
Deepseek rushing to launch a new AI

00:08:07.599 --> 00:08:09.120
model and they are going to be going

00:08:09.120 --> 00:08:12.360
allin and this article was in February

00:08:12.360 --> 00:08:15.440
2025 So I am wondering when this model

00:08:15.440 --> 00:08:17.120
is going to be released because we still

00:08:17.120 --> 00:08:19.599
do have many other open-source AIs like

00:08:19.599 --> 00:08:22.360
Quen coming soon and potentially even

00:08:22.360 --> 00:08:26.160
GPT5 and 03 Mini So it is super super

00:08:26.160 --> 00:08:28.319
important that Deepseek do maintain the

00:08:28.319 --> 00:08:30.400
momentum as it is quite hard to stay

00:08:30.400 --> 00:08:32.880
relevant in the AI space Now they do

00:08:32.880 --> 00:08:36.159
state that it is possible that they will

00:08:36.159 --> 00:08:39.279
be releasing the model as early as May

00:08:39.279 --> 00:08:41.279
because that was when their plan was

00:08:41.279 --> 00:08:43.519
They plan to release it in May but now

00:08:43.519 --> 00:08:46.080
they want it as early as possible So I

00:08:46.080 --> 00:08:47.920
wouldn't personally be surprised if we

00:08:47.920 --> 00:08:50.560
do get Deep Seek R2 potentially at the

00:08:50.560 --> 00:08:53.360
end of this month I mean it's a once in

00:08:53.360 --> 00:08:55.440
a-lifetime opportunity to put your

00:08:55.440 --> 00:08:58.320
company out even further ahead of OpenAI

00:08:58.320 --> 00:09:00.640
and steal the limelight once again I do

00:09:00.640 --> 00:09:02.080
think that this could potentially happen

00:09:02.080 --> 00:09:04.240
but of course we will have to see And

00:09:04.240 --> 00:09:06.080
individuals within the industry are

00:09:06.080 --> 00:09:08.240
saying that the launch of Deep Seek's R2

00:09:08.240 --> 00:09:10.480
model could be a pivotal moment in the

00:09:10.480 --> 00:09:12.800
AI industry And I do wonder how this

00:09:12.800 --> 00:09:15.360
news of Deepseek's further advancements

00:09:15.360 --> 00:09:17.360
are going to affect companies like Meta

00:09:17.360 --> 00:09:19.040
and their open- source efforts Of course

00:09:19.040 --> 00:09:20.320
like I said before if you watched

00:09:20.320 --> 00:09:22.480
yesterday's video you'll know that there

00:09:22.480 --> 00:09:24.800
was a ton of controversy around the

00:09:24.800 --> 00:09:26.959
Llama 4 release with many critics

00:09:26.959 --> 00:09:29.120
arguing that they fixed the benchmarks

00:09:29.120 --> 00:09:30.959
and the model just simply isn't that

00:09:30.959 --> 00:09:32.800
good So it will be interesting to see

00:09:32.800 --> 00:09:35.600
exactly how that announcement of R2 when

00:09:35.600 --> 00:09:39.360
it is released affects the AI industry