1
00:00:00,080 --> 00:00:01,839
so are Deepseek creating a

2
00:00:01,839 --> 00:00:04,000
self-improving AI well that's the latest

3
00:00:04,000 --> 00:00:06,080
claim coming out of this newspaper and

4
00:00:06,080 --> 00:00:08,160
they're talking about the fact that last

5
00:00:08,160 --> 00:00:10,320
Friday Deepseek released a paper for

6
00:00:10,320 --> 00:00:13,120
self-improving AI models Now there are a

7
00:00:13,120 --> 00:00:15,040
little bit of caveats to this I mean

8
00:00:15,040 --> 00:00:16,800
there is some truth statements to this

9
00:00:16,800 --> 00:00:19,119
but I'm going to dive into how this all

10
00:00:19,119 --> 00:00:21,199
works and the truth about this claim So

11
00:00:21,199 --> 00:00:23,039
you can see right here that the article

12
00:00:23,039 --> 00:00:25,359
starts by saying "China's Deepseek finds

13
00:00:25,359 --> 00:00:27,760
a way to help AI get better at answering

14
00:00:27,760 --> 00:00:30,960
questions Here's how it works So this

15
00:00:30,960 --> 00:00:33,760
actually got the Twitter sphere in a

16
00:00:33,760 --> 00:00:35,920
hold because so many people were

17
00:00:35,920 --> 00:00:38,800
wondering exactly what on earth this is

18
00:00:38,800 --> 00:00:40,879
and how on earth DeepS are actually

19
00:00:40,879 --> 00:00:43,200
changing the game when it comes to

20
00:00:43,200 --> 00:00:45,760
having an AI improve itself Now we all

21
00:00:45,760 --> 00:00:48,399
know that Deepseek are an open research

22
00:00:48,399 --> 00:00:51,360
company or open source company and of

23
00:00:51,360 --> 00:00:53,199
course that means that they publish

24
00:00:53,199 --> 00:00:55,680
their research So right here you can see

25
00:00:55,680 --> 00:00:58,640
exactly the research that was published

26
00:00:58,640 --> 00:01:00,640
and it talks about inference time

27
00:01:00,640 --> 00:01:03,520
scaling for generous reward modeling Now

28
00:01:03,520 --> 00:01:05,040
trust me guys I'm not going to bore you

29
00:01:05,040 --> 00:01:06,799
with the specifics of everything I'll

30
00:01:06,799 --> 00:01:08,640
actually explain this to you in a way

31
00:01:08,640 --> 00:01:11,280
that you can understand But the results

32
00:01:11,280 --> 00:01:14,320
that we are seeing is that yes over time

33
00:01:14,320 --> 00:01:17,520
with inference time scaling the model

34
00:01:17,520 --> 00:01:20,159
does improve So what we can see here and

35
00:01:20,159 --> 00:01:21,759
trust me guys it's actually not that

36
00:01:21,759 --> 00:01:24,240
hard to understand is that every time

37
00:01:24,240 --> 00:01:26,400
you sample the model we show how

38
00:01:26,400 --> 00:01:28,799
accurate that rating becomes So let's

39
00:01:28,799 --> 00:01:31,119
say for example we were asking the AI

40
00:01:31,119 --> 00:01:32,799
assistants which are the reward models

41
00:01:32,799 --> 00:01:35,280
in this case to rate how good this AI's

42
00:01:35,280 --> 00:01:37,119
response was The graph here is showing

43
00:01:37,119 --> 00:01:39,520
how accurate these results get over time

44
00:01:39,520 --> 00:01:41,680
The numbers here show the AI's

45
00:01:41,680 --> 00:01:43,280
performance and the numbers here show

46
00:01:43,280 --> 00:01:45,280
how many times it tried Now this isn't

47
00:01:45,280 --> 00:01:47,280
the only graph that shows you just how

48
00:01:47,280 --> 00:01:50,000
incredible this is because I think it's

49
00:01:50,000 --> 00:01:51,680
really important that they're also

50
00:01:51,680 --> 00:01:54,000
showing us the other models here And

51
00:01:54,000 --> 00:01:56,799
they show GPT4 here which is really

52
00:01:56,799 --> 00:01:58,640
interesting Although I will say I'm not

53
00:01:58,640 --> 00:02:01,360
sure which version of GPT40 this is as

54
00:02:01,360 --> 00:02:03,079
there are multiple different versions of

55
00:02:03,079 --> 00:02:06,640
GPT40 that currently do exist So this

56
00:02:06,640 --> 00:02:09,039
one outperforming it with inference time

57
00:02:09,039 --> 00:02:11,160
is a little bit of a caveat because

58
00:02:11,160 --> 00:02:13,760
GPT40 isn't actually an inference time

59
00:02:13,760 --> 00:02:15,760
model I'm guessing they're just showing

60
00:02:15,760 --> 00:02:18,000
just how good the model gets in terms of

61
00:02:18,000 --> 00:02:20,400
overall performance So actually we'll

62
00:02:20,400 --> 00:02:22,000
explain to you guys how this works

63
00:02:22,000 --> 00:02:24,080
because I think this probably is going

64
00:02:24,080 --> 00:02:27,040
to be used in the base of the next model

65
00:02:27,040 --> 00:02:29,440
Deepseek R2 which is probably going to

66
00:02:29,440 --> 00:02:31,760
be Deep Seek's next Frontier model that

67
00:02:31,760 --> 00:02:34,480
could outperform the next level of AI So

68
00:02:34,480 --> 00:02:36,640
how this entire thing works and I'm just

69
00:02:36,640 --> 00:02:38,560
going to break this down for you guys is

70
00:02:38,560 --> 00:02:41,120
the goal here is to get the AI to

71
00:02:41,120 --> 00:02:43,040
improve Let's just say in this example

72
00:02:43,040 --> 00:02:45,599
chat GBT So now if they want to get the

73
00:02:45,599 --> 00:02:48,160
AI to improve is to train it to have

74
00:02:48,160 --> 00:02:51,200
another AI actually judge how good the

75
00:02:51,200 --> 00:02:53,599
answers are And this judge is basically

76
00:02:53,599 --> 00:02:56,720
called a reward model And this AI paper

77
00:02:56,720 --> 00:02:59,120
basically wants to create a very good

78
00:02:59,120 --> 00:03:01,599
versatile judge Now there's a few

79
00:03:01,599 --> 00:03:04,319
problems with current AI judges And

80
00:03:04,319 --> 00:03:06,080
number one is that they're not general

81
00:03:06,080 --> 00:03:08,000
enough They might be good at judging

82
00:03:08,000 --> 00:03:10,800
math answers but when you know you judge

83
00:03:10,800 --> 00:03:13,519
creative answers or vice versa they just

84
00:03:13,519 --> 00:03:15,440
don't perform that well And now another

85
00:03:15,440 --> 00:03:18,319
thing as well is that these AI judges

86
00:03:18,319 --> 00:03:20,720
they don't improve that much on the spot

87
00:03:20,720 --> 00:03:23,360
So giving them more computer power with

88
00:03:23,360 --> 00:03:24,879
you know inference time doesn't

89
00:03:24,879 --> 00:03:27,040
necessarily make their judgment much

90
00:03:27,040 --> 00:03:30,239
better They actually improve most times

91
00:03:30,239 --> 00:03:32,560
after training And so because we have

92
00:03:32,560 --> 00:03:35,360
these ingrained issues into the judges

93
00:03:35,360 --> 00:03:36,959
when we're trying to improve the model's

94
00:03:36,959 --> 00:03:39,840
responses with these reward models we of

95
00:03:39,840 --> 00:03:41,760
course have to come up with a better

96
00:03:41,760 --> 00:03:44,319
solution And that's where Deepseek

97
00:03:44,319 --> 00:03:46,560
solution comes into the picture This is

98
00:03:46,560 --> 00:03:49,599
where we get the GRM judge So the paper

99
00:03:49,599 --> 00:03:52,879
solution is called Deepseek GRM Now this

100
00:03:52,879 --> 00:03:54,959
is a different kind of judge Instead of

101
00:03:54,959 --> 00:03:56,879
just outputting a score like seven out

102
00:03:56,879 --> 00:03:59,120
of 10 their judge writes out the

103
00:03:59,120 --> 00:04:02,159
reasoning and then it explains why an

104
00:04:02,159 --> 00:04:04,959
answer is good or bad based on specific

105
00:04:04,959 --> 00:04:06,560
principles and these are going to be the

106
00:04:06,560 --> 00:04:08,480
rules that it manages to come up with

107
00:04:08,480 --> 00:04:10,799
Then from that reasoning the score is

108
00:04:10,799 --> 00:04:12,560
then extracted Now the reason that

109
00:04:12,560 --> 00:04:14,400
they've chosen to do this is because

110
00:04:14,400 --> 00:04:16,560
this is more flexible It's more detailed

111
00:04:16,560 --> 00:04:18,479
and more importantly if you ask it to

112
00:04:18,479 --> 00:04:20,560
judge the same thing multiple times it

113
00:04:20,560 --> 00:04:22,960
might write multiple slightly different

114
00:04:22,960 --> 00:04:24,880
reasons or principles leading to

115
00:04:24,880 --> 00:04:26,720
slightly different scores And so what

116
00:04:26,720 --> 00:04:28,800
they actually use is they train this

117
00:04:28,800 --> 00:04:32,400
judge with SPCT So this is where they

118
00:04:32,400 --> 00:04:34,240
use reinforcement learning which is

119
00:04:34,240 --> 00:04:36,000
where you know you can like train an AI

120
00:04:36,000 --> 00:04:37,440
to play a game which is like training an

121
00:04:37,440 --> 00:04:39,759
AI to play a game The judge AI practices

122
00:04:39,759 --> 00:04:41,759
generating principles and critiques and

123
00:04:41,759 --> 00:04:44,160
its final judgment is based on if it

124
00:04:44,160 --> 00:04:46,080
matches the correct judgment and then of

125
00:04:46,080 --> 00:04:48,160
course it gets a reward and it learns to

126
00:04:48,160 --> 00:04:49,680
do that more So over time they're

127
00:04:49,680 --> 00:04:52,720
reinforcing the good behavior from this

128
00:04:52,720 --> 00:04:54,479
model And the reason they're doing this

129
00:04:54,479 --> 00:04:57,280
is because this teaches the judge AI to

130
00:04:57,280 --> 00:04:59,600
generate good principles and critiques

131
00:04:59,600 --> 00:05:01,680
that lead to accurate judgments that

132
00:05:01,680 --> 00:05:04,240
over the time make it smarter Now of

133
00:05:04,240 --> 00:05:06,720
course making it better on the spot is

134
00:05:06,720 --> 00:05:09,680
where we have the inference scaling So

135
00:05:09,680 --> 00:05:11,759
you know that famous famous paradigm

136
00:05:11,759 --> 00:05:14,320
that everyone is talking about is once

137
00:05:14,320 --> 00:05:16,720
again making headwinds again So what

138
00:05:16,720 --> 00:05:18,960
they do here is they ask it multiple

139
00:05:18,960 --> 00:05:21,120
times which is called sampling So when

140
00:05:21,120 --> 00:05:23,520
they want a judge over here to get an

141
00:05:23,520 --> 00:05:25,759
answer they ask their trained judge

142
00:05:25,759 --> 00:05:27,520
multiple times So they'll ask it eight

143
00:05:27,520 --> 00:05:30,320
or 32 times Then what they'll do is

144
00:05:30,320 --> 00:05:32,400
they'll then combine those judgments

145
00:05:32,400 --> 00:05:34,240
which is where they'll have voting So

146
00:05:34,240 --> 00:05:36,080
they'll collect all the scores from the

147
00:05:36,080 --> 00:05:38,639
multiple tries and combine them into you

148
00:05:38,639 --> 00:05:40,880
know I guess you could say an average

149
00:05:40,880 --> 00:05:41,919
And all of this is happening at

150
00:05:41,919 --> 00:05:44,160
inference time They then have smart

151
00:05:44,160 --> 00:05:46,240
combining which is where they trained

152
00:05:46,240 --> 00:05:50,000
another tiny AI called a metarm whose

153
00:05:50,000 --> 00:05:52,000
only job is to quickly rate how good

154
00:05:52,000 --> 00:05:53,840
each written critique from the main

155
00:05:53,840 --> 00:05:56,479
judge was Then they combine the scores

156
00:05:56,479 --> 00:05:58,240
only from the critiques that the Metar

157
00:05:58,240 --> 00:06:00,080
RM thought were good And this is because

158
00:06:00,080 --> 00:06:01,840
asking multiple times and combining

159
00:06:01,840 --> 00:06:04,240
results especially the meta RM helper

160
00:06:04,240 --> 00:06:06,000
makes the final judgment much more

161
00:06:06,000 --> 00:06:08,080
reliable than just asking once And this

162
00:06:08,080 --> 00:06:11,039
is much more compute inensive but it

163
00:06:11,039 --> 00:06:12,960
does give a better result Now overall

164
00:06:12,960 --> 00:06:14,880
you might be thinking all right that is

165
00:06:14,880 --> 00:06:16,639
pretty interesting but what are the

166
00:06:16,639 --> 00:06:18,400
results here so the results are that

167
00:06:18,400 --> 00:06:20,560
this of course performs really well The

168
00:06:20,560 --> 00:06:23,680
AI judge works very well across many

169
00:06:23,680 --> 00:06:25,840
different tasks The ask multiple times

170
00:06:25,840 --> 00:06:29,280
strategy dramatically improves the AI's

171
00:06:29,280 --> 00:06:31,759
accuracy and the more times they ask it

172
00:06:31,759 --> 00:06:33,840
or the more compute they use the better

173
00:06:33,840 --> 00:06:36,919
it gets And their judge AI even though

174
00:06:36,919 --> 00:06:39,840
medium-sized could outperform much

175
00:06:39,840 --> 00:06:41,560
larger AIs like

176
00:06:41,560 --> 00:06:44,479
GPT40 used as a judge And if those

177
00:06:44,479 --> 00:06:46,880
larger AIs were only asked once so

178
00:06:46,880 --> 00:06:48,880
overall it's looking really good And

179
00:06:48,880 --> 00:06:51,280
using the meta RM that tiny AI that

180
00:06:51,280 --> 00:06:52,720
little helper to pick out the best

181
00:06:52,720 --> 00:06:54,960
critiques works even better than simple

182
00:06:54,960 --> 00:06:56,639
voting They basically built a really

183
00:06:56,639 --> 00:06:58,400
smart AI judge that explains its

184
00:06:58,400 --> 00:07:00,240
reasoning They trained it in a really

185
00:07:00,240 --> 00:07:02,479
clever way And crucially this judge gets

186
00:07:02,479 --> 00:07:04,319
better if you let it think multiple

187
00:07:04,319 --> 00:07:06,400
times which is just the same as using

188
00:07:06,400 --> 00:07:07,919
more compute And this allows a

189
00:07:07,919 --> 00:07:10,400
moderately sized AI judge to achieve top

190
00:07:10,400 --> 00:07:13,440
performance when needed So overall this

191
00:07:13,440 --> 00:07:15,520
is once again something really

192
00:07:15,520 --> 00:07:17,120
interesting because we know that

193
00:07:17,120 --> 00:07:19,199
Deepseek are continuing to push the

194
00:07:19,199 --> 00:07:21,120
frontier in terms of innovation And I

195
00:07:21,120 --> 00:07:23,280
will say it's really interesting that

196
00:07:23,280 --> 00:07:25,759
we're seeing that the tides are shifting

197
00:07:25,759 --> 00:07:27,759
It's not like China is copying the West

198
00:07:27,759 --> 00:07:29,520
anymore They're actually innovating on

199
00:07:29,520 --> 00:07:31,759
their own front Now one thing that I do

200
00:07:31,759 --> 00:07:33,759
want to talk about is of course the fact

201
00:07:33,759 --> 00:07:37,039
that R2 is coming We do know that Deep

202
00:07:37,039 --> 00:07:39,360
Seek's new Frontier models are going to

203
00:07:39,360 --> 00:07:43,039
be coming fairly soon And with this

204
00:07:43,039 --> 00:07:45,840
recent research paper I am wondering if

205
00:07:45,840 --> 00:07:48,000
that is going to be part of their next

206
00:07:48,000 --> 00:07:50,080
AI release We know that they are working

207
00:07:50,080 --> 00:07:52,240
super hard and some would argue that

208
00:07:52,240 --> 00:07:54,400
they're even ahead of recent Frontier

209
00:07:54,400 --> 00:07:56,879
Labs I mean a lot of people know about

210
00:07:56,879 --> 00:07:59,520
Meta's Llama 4 and how there have been

211
00:07:59,520 --> 00:08:01,840
many many different mishaps with regards

212
00:08:01,840 --> 00:08:04,879
to how that model has performed So

213
00:08:04,879 --> 00:08:07,599
Deepseek rushing to launch a new AI

214
00:08:07,599 --> 00:08:09,120
model and they are going to be going

215
00:08:09,120 --> 00:08:12,360
allin and this article was in February

216
00:08:12,360 --> 00:08:15,440
2025 So I am wondering when this model

217
00:08:15,440 --> 00:08:17,120
is going to be released because we still

218
00:08:17,120 --> 00:08:19,599
do have many other open-source AIs like

219
00:08:19,599 --> 00:08:22,360
Quen coming soon and potentially even

220
00:08:22,360 --> 00:08:26,160
GPT5 and 03 Mini So it is super super

221
00:08:26,160 --> 00:08:28,319
important that Deepseek do maintain the

222
00:08:28,319 --> 00:08:30,400
momentum as it is quite hard to stay

223
00:08:30,400 --> 00:08:32,880
relevant in the AI space Now they do

224
00:08:32,880 --> 00:08:36,159
state that it is possible that they will

225
00:08:36,159 --> 00:08:39,279
be releasing the model as early as May

226
00:08:39,279 --> 00:08:41,279
because that was when their plan was

227
00:08:41,279 --> 00:08:43,519
They plan to release it in May but now

228
00:08:43,519 --> 00:08:46,080
they want it as early as possible So I

229
00:08:46,080 --> 00:08:47,920
wouldn't personally be surprised if we

230
00:08:47,920 --> 00:08:50,560
do get Deep Seek R2 potentially at the

231
00:08:50,560 --> 00:08:53,360
end of this month I mean it's a once in

232
00:08:53,360 --> 00:08:55,440
a-lifetime opportunity to put your

233
00:08:55,440 --> 00:08:58,320
company out even further ahead of OpenAI

234
00:08:58,320 --> 00:09:00,640
and steal the limelight once again I do

235
00:09:00,640 --> 00:09:02,080
think that this could potentially happen

236
00:09:02,080 --> 00:09:04,240
but of course we will have to see And

237
00:09:04,240 --> 00:09:06,080
individuals within the industry are

238
00:09:06,080 --> 00:09:08,240
saying that the launch of Deep Seek's R2

239
00:09:08,240 --> 00:09:10,480
model could be a pivotal moment in the

240
00:09:10,480 --> 00:09:12,800
AI industry And I do wonder how this

241
00:09:12,800 --> 00:09:15,360
news of Deepseek's further advancements

242
00:09:15,360 --> 00:09:17,360
are going to affect companies like Meta

243
00:09:17,360 --> 00:09:19,040
and their open- source efforts Of course

244
00:09:19,040 --> 00:09:20,320
like I said before if you watched

245
00:09:20,320 --> 00:09:22,480
yesterday's video you'll know that there

246
00:09:22,480 --> 00:09:24,800
was a ton of controversy around the

247
00:09:24,800 --> 00:09:26,959
Llama 4 release with many critics

248
00:09:26,959 --> 00:09:29,120
arguing that they fixed the benchmarks

249
00:09:29,120 --> 00:09:30,959
and the model just simply isn't that

250
00:09:30,959 --> 00:09:32,800
good So it will be interesting to see

251
00:09:32,800 --> 00:09:35,600
exactly how that announcement of R2 when

252
00:09:35,600 --> 00:09:39,360
it is released affects the AI industry