1 00:00:00,080 --> 00:00:01,839 so are Deepseek creating a 2 00:00:01,839 --> 00:00:04,000 self-improving AI well that's the latest 3 00:00:04,000 --> 00:00:06,080 claim coming out of this newspaper and 4 00:00:06,080 --> 00:00:08,160 they're talking about the fact that last 5 00:00:08,160 --> 00:00:10,320 Friday Deepseek released a paper for 6 00:00:10,320 --> 00:00:13,120 self-improving AI models Now there are a 7 00:00:13,120 --> 00:00:15,040 little bit of caveats to this I mean 8 00:00:15,040 --> 00:00:16,800 there is some truth statements to this 9 00:00:16,800 --> 00:00:19,119 but I'm going to dive into how this all 10 00:00:19,119 --> 00:00:21,199 works and the truth about this claim So 11 00:00:21,199 --> 00:00:23,039 you can see right here that the article 12 00:00:23,039 --> 00:00:25,359 starts by saying "China's Deepseek finds 13 00:00:25,359 --> 00:00:27,760 a way to help AI get better at answering 14 00:00:27,760 --> 00:00:30,960 questions Here's how it works So this 15 00:00:30,960 --> 00:00:33,760 actually got the Twitter sphere in a 16 00:00:33,760 --> 00:00:35,920 hold because so many people were 17 00:00:35,920 --> 00:00:38,800 wondering exactly what on earth this is 18 00:00:38,800 --> 00:00:40,879 and how on earth DeepS are actually 19 00:00:40,879 --> 00:00:43,200 changing the game when it comes to 20 00:00:43,200 --> 00:00:45,760 having an AI improve itself Now we all 21 00:00:45,760 --> 00:00:48,399 know that Deepseek are an open research 22 00:00:48,399 --> 00:00:51,360 company or open source company and of 23 00:00:51,360 --> 00:00:53,199 course that means that they publish 24 00:00:53,199 --> 00:00:55,680 their research So right here you can see 25 00:00:55,680 --> 00:00:58,640 exactly the research that was published 26 00:00:58,640 --> 00:01:00,640 and it talks about inference time 27 00:01:00,640 --> 00:01:03,520 scaling for generous reward modeling Now 28 00:01:03,520 --> 00:01:05,040 trust me guys I'm not going to bore you 29 00:01:05,040 --> 00:01:06,799 with the specifics of everything I'll 30 00:01:06,799 --> 00:01:08,640 actually explain this to you in a way 31 00:01:08,640 --> 00:01:11,280 that you can understand But the results 32 00:01:11,280 --> 00:01:14,320 that we are seeing is that yes over time 33 00:01:14,320 --> 00:01:17,520 with inference time scaling the model 34 00:01:17,520 --> 00:01:20,159 does improve So what we can see here and 35 00:01:20,159 --> 00:01:21,759 trust me guys it's actually not that 36 00:01:21,759 --> 00:01:24,240 hard to understand is that every time 37 00:01:24,240 --> 00:01:26,400 you sample the model we show how 38 00:01:26,400 --> 00:01:28,799 accurate that rating becomes So let's 39 00:01:28,799 --> 00:01:31,119 say for example we were asking the AI 40 00:01:31,119 --> 00:01:32,799 assistants which are the reward models 41 00:01:32,799 --> 00:01:35,280 in this case to rate how good this AI's 42 00:01:35,280 --> 00:01:37,119 response was The graph here is showing 43 00:01:37,119 --> 00:01:39,520 how accurate these results get over time 44 00:01:39,520 --> 00:01:41,680 The numbers here show the AI's 45 00:01:41,680 --> 00:01:43,280 performance and the numbers here show 46 00:01:43,280 --> 00:01:45,280 how many times it tried Now this isn't 47 00:01:45,280 --> 00:01:47,280 the only graph that shows you just how 48 00:01:47,280 --> 00:01:50,000 incredible this is because I think it's 49 00:01:50,000 --> 00:01:51,680 really important that they're also 50 00:01:51,680 --> 00:01:54,000 showing us the other models here And 51 00:01:54,000 --> 00:01:56,799 they show GPT4 here which is really 52 00:01:56,799 --> 00:01:58,640 interesting Although I will say I'm not 53 00:01:58,640 --> 00:02:01,360 sure which version of GPT40 this is as 54 00:02:01,360 --> 00:02:03,079 there are multiple different versions of 55 00:02:03,079 --> 00:02:06,640 GPT40 that currently do exist So this 56 00:02:06,640 --> 00:02:09,039 one outperforming it with inference time 57 00:02:09,039 --> 00:02:11,160 is a little bit of a caveat because 58 00:02:11,160 --> 00:02:13,760 GPT40 isn't actually an inference time 59 00:02:13,760 --> 00:02:15,760 model I'm guessing they're just showing 60 00:02:15,760 --> 00:02:18,000 just how good the model gets in terms of 61 00:02:18,000 --> 00:02:20,400 overall performance So actually we'll 62 00:02:20,400 --> 00:02:22,000 explain to you guys how this works 63 00:02:22,000 --> 00:02:24,080 because I think this probably is going 64 00:02:24,080 --> 00:02:27,040 to be used in the base of the next model 65 00:02:27,040 --> 00:02:29,440 Deepseek R2 which is probably going to 66 00:02:29,440 --> 00:02:31,760 be Deep Seek's next Frontier model that 67 00:02:31,760 --> 00:02:34,480 could outperform the next level of AI So 68 00:02:34,480 --> 00:02:36,640 how this entire thing works and I'm just 69 00:02:36,640 --> 00:02:38,560 going to break this down for you guys is 70 00:02:38,560 --> 00:02:41,120 the goal here is to get the AI to 71 00:02:41,120 --> 00:02:43,040 improve Let's just say in this example 72 00:02:43,040 --> 00:02:45,599 chat GBT So now if they want to get the 73 00:02:45,599 --> 00:02:48,160 AI to improve is to train it to have 74 00:02:48,160 --> 00:02:51,200 another AI actually judge how good the 75 00:02:51,200 --> 00:02:53,599 answers are And this judge is basically 76 00:02:53,599 --> 00:02:56,720 called a reward model And this AI paper 77 00:02:56,720 --> 00:02:59,120 basically wants to create a very good 78 00:02:59,120 --> 00:03:01,599 versatile judge Now there's a few 79 00:03:01,599 --> 00:03:04,319 problems with current AI judges And 80 00:03:04,319 --> 00:03:06,080 number one is that they're not general 81 00:03:06,080 --> 00:03:08,000 enough They might be good at judging 82 00:03:08,000 --> 00:03:10,800 math answers but when you know you judge 83 00:03:10,800 --> 00:03:13,519 creative answers or vice versa they just 84 00:03:13,519 --> 00:03:15,440 don't perform that well And now another 85 00:03:15,440 --> 00:03:18,319 thing as well is that these AI judges 86 00:03:18,319 --> 00:03:20,720 they don't improve that much on the spot 87 00:03:20,720 --> 00:03:23,360 So giving them more computer power with 88 00:03:23,360 --> 00:03:24,879 you know inference time doesn't 89 00:03:24,879 --> 00:03:27,040 necessarily make their judgment much 90 00:03:27,040 --> 00:03:30,239 better They actually improve most times 91 00:03:30,239 --> 00:03:32,560 after training And so because we have 92 00:03:32,560 --> 00:03:35,360 these ingrained issues into the judges 93 00:03:35,360 --> 00:03:36,959 when we're trying to improve the model's 94 00:03:36,959 --> 00:03:39,840 responses with these reward models we of 95 00:03:39,840 --> 00:03:41,760 course have to come up with a better 96 00:03:41,760 --> 00:03:44,319 solution And that's where Deepseek 97 00:03:44,319 --> 00:03:46,560 solution comes into the picture This is 98 00:03:46,560 --> 00:03:49,599 where we get the GRM judge So the paper 99 00:03:49,599 --> 00:03:52,879 solution is called Deepseek GRM Now this 100 00:03:52,879 --> 00:03:54,959 is a different kind of judge Instead of 101 00:03:54,959 --> 00:03:56,879 just outputting a score like seven out 102 00:03:56,879 --> 00:03:59,120 of 10 their judge writes out the 103 00:03:59,120 --> 00:04:02,159 reasoning and then it explains why an 104 00:04:02,159 --> 00:04:04,959 answer is good or bad based on specific 105 00:04:04,959 --> 00:04:06,560 principles and these are going to be the 106 00:04:06,560 --> 00:04:08,480 rules that it manages to come up with 107 00:04:08,480 --> 00:04:10,799 Then from that reasoning the score is 108 00:04:10,799 --> 00:04:12,560 then extracted Now the reason that 109 00:04:12,560 --> 00:04:14,400 they've chosen to do this is because 110 00:04:14,400 --> 00:04:16,560 this is more flexible It's more detailed 111 00:04:16,560 --> 00:04:18,479 and more importantly if you ask it to 112 00:04:18,479 --> 00:04:20,560 judge the same thing multiple times it 113 00:04:20,560 --> 00:04:22,960 might write multiple slightly different 114 00:04:22,960 --> 00:04:24,880 reasons or principles leading to 115 00:04:24,880 --> 00:04:26,720 slightly different scores And so what 116 00:04:26,720 --> 00:04:28,800 they actually use is they train this 117 00:04:28,800 --> 00:04:32,400 judge with SPCT So this is where they 118 00:04:32,400 --> 00:04:34,240 use reinforcement learning which is 119 00:04:34,240 --> 00:04:36,000 where you know you can like train an AI 120 00:04:36,000 --> 00:04:37,440 to play a game which is like training an 121 00:04:37,440 --> 00:04:39,759 AI to play a game The judge AI practices 122 00:04:39,759 --> 00:04:41,759 generating principles and critiques and 123 00:04:41,759 --> 00:04:44,160 its final judgment is based on if it 124 00:04:44,160 --> 00:04:46,080 matches the correct judgment and then of 125 00:04:46,080 --> 00:04:48,160 course it gets a reward and it learns to 126 00:04:48,160 --> 00:04:49,680 do that more So over time they're 127 00:04:49,680 --> 00:04:52,720 reinforcing the good behavior from this 128 00:04:52,720 --> 00:04:54,479 model And the reason they're doing this 129 00:04:54,479 --> 00:04:57,280 is because this teaches the judge AI to 130 00:04:57,280 --> 00:04:59,600 generate good principles and critiques 131 00:04:59,600 --> 00:05:01,680 that lead to accurate judgments that 132 00:05:01,680 --> 00:05:04,240 over the time make it smarter Now of 133 00:05:04,240 --> 00:05:06,720 course making it better on the spot is 134 00:05:06,720 --> 00:05:09,680 where we have the inference scaling So 135 00:05:09,680 --> 00:05:11,759 you know that famous famous paradigm 136 00:05:11,759 --> 00:05:14,320 that everyone is talking about is once 137 00:05:14,320 --> 00:05:16,720 again making headwinds again So what 138 00:05:16,720 --> 00:05:18,960 they do here is they ask it multiple 139 00:05:18,960 --> 00:05:21,120 times which is called sampling So when 140 00:05:21,120 --> 00:05:23,520 they want a judge over here to get an 141 00:05:23,520 --> 00:05:25,759 answer they ask their trained judge 142 00:05:25,759 --> 00:05:27,520 multiple times So they'll ask it eight 143 00:05:27,520 --> 00:05:30,320 or 32 times Then what they'll do is 144 00:05:30,320 --> 00:05:32,400 they'll then combine those judgments 145 00:05:32,400 --> 00:05:34,240 which is where they'll have voting So 146 00:05:34,240 --> 00:05:36,080 they'll collect all the scores from the 147 00:05:36,080 --> 00:05:38,639 multiple tries and combine them into you 148 00:05:38,639 --> 00:05:40,880 know I guess you could say an average 149 00:05:40,880 --> 00:05:41,919 And all of this is happening at 150 00:05:41,919 --> 00:05:44,160 inference time They then have smart 151 00:05:44,160 --> 00:05:46,240 combining which is where they trained 152 00:05:46,240 --> 00:05:50,000 another tiny AI called a metarm whose 153 00:05:50,000 --> 00:05:52,000 only job is to quickly rate how good 154 00:05:52,000 --> 00:05:53,840 each written critique from the main 155 00:05:53,840 --> 00:05:56,479 judge was Then they combine the scores 156 00:05:56,479 --> 00:05:58,240 only from the critiques that the Metar 157 00:05:58,240 --> 00:06:00,080 RM thought were good And this is because 158 00:06:00,080 --> 00:06:01,840 asking multiple times and combining 159 00:06:01,840 --> 00:06:04,240 results especially the meta RM helper 160 00:06:04,240 --> 00:06:06,000 makes the final judgment much more 161 00:06:06,000 --> 00:06:08,080 reliable than just asking once And this 162 00:06:08,080 --> 00:06:11,039 is much more compute inensive but it 163 00:06:11,039 --> 00:06:12,960 does give a better result Now overall 164 00:06:12,960 --> 00:06:14,880 you might be thinking all right that is 165 00:06:14,880 --> 00:06:16,639 pretty interesting but what are the 166 00:06:16,639 --> 00:06:18,400 results here so the results are that 167 00:06:18,400 --> 00:06:20,560 this of course performs really well The 168 00:06:20,560 --> 00:06:23,680 AI judge works very well across many 169 00:06:23,680 --> 00:06:25,840 different tasks The ask multiple times 170 00:06:25,840 --> 00:06:29,280 strategy dramatically improves the AI's 171 00:06:29,280 --> 00:06:31,759 accuracy and the more times they ask it 172 00:06:31,759 --> 00:06:33,840 or the more compute they use the better 173 00:06:33,840 --> 00:06:36,919 it gets And their judge AI even though 174 00:06:36,919 --> 00:06:39,840 medium-sized could outperform much 175 00:06:39,840 --> 00:06:41,560 larger AIs like 176 00:06:41,560 --> 00:06:44,479 GPT40 used as a judge And if those 177 00:06:44,479 --> 00:06:46,880 larger AIs were only asked once so 178 00:06:46,880 --> 00:06:48,880 overall it's looking really good And 179 00:06:48,880 --> 00:06:51,280 using the meta RM that tiny AI that 180 00:06:51,280 --> 00:06:52,720 little helper to pick out the best 181 00:06:52,720 --> 00:06:54,960 critiques works even better than simple 182 00:06:54,960 --> 00:06:56,639 voting They basically built a really 183 00:06:56,639 --> 00:06:58,400 smart AI judge that explains its 184 00:06:58,400 --> 00:07:00,240 reasoning They trained it in a really 185 00:07:00,240 --> 00:07:02,479 clever way And crucially this judge gets 186 00:07:02,479 --> 00:07:04,319 better if you let it think multiple 187 00:07:04,319 --> 00:07:06,400 times which is just the same as using 188 00:07:06,400 --> 00:07:07,919 more compute And this allows a 189 00:07:07,919 --> 00:07:10,400 moderately sized AI judge to achieve top 190 00:07:10,400 --> 00:07:13,440 performance when needed So overall this 191 00:07:13,440 --> 00:07:15,520 is once again something really 192 00:07:15,520 --> 00:07:17,120 interesting because we know that 193 00:07:17,120 --> 00:07:19,199 Deepseek are continuing to push the 194 00:07:19,199 --> 00:07:21,120 frontier in terms of innovation And I 195 00:07:21,120 --> 00:07:23,280 will say it's really interesting that 196 00:07:23,280 --> 00:07:25,759 we're seeing that the tides are shifting 197 00:07:25,759 --> 00:07:27,759 It's not like China is copying the West 198 00:07:27,759 --> 00:07:29,520 anymore They're actually innovating on 199 00:07:29,520 --> 00:07:31,759 their own front Now one thing that I do 200 00:07:31,759 --> 00:07:33,759 want to talk about is of course the fact 201 00:07:33,759 --> 00:07:37,039 that R2 is coming We do know that Deep 202 00:07:37,039 --> 00:07:39,360 Seek's new Frontier models are going to 203 00:07:39,360 --> 00:07:43,039 be coming fairly soon And with this 204 00:07:43,039 --> 00:07:45,840 recent research paper I am wondering if 205 00:07:45,840 --> 00:07:48,000 that is going to be part of their next 206 00:07:48,000 --> 00:07:50,080 AI release We know that they are working 207 00:07:50,080 --> 00:07:52,240 super hard and some would argue that 208 00:07:52,240 --> 00:07:54,400 they're even ahead of recent Frontier 209 00:07:54,400 --> 00:07:56,879 Labs I mean a lot of people know about 210 00:07:56,879 --> 00:07:59,520 Meta's Llama 4 and how there have been 211 00:07:59,520 --> 00:08:01,840 many many different mishaps with regards 212 00:08:01,840 --> 00:08:04,879 to how that model has performed So 213 00:08:04,879 --> 00:08:07,599 Deepseek rushing to launch a new AI 214 00:08:07,599 --> 00:08:09,120 model and they are going to be going 215 00:08:09,120 --> 00:08:12,360 allin and this article was in February 216 00:08:12,360 --> 00:08:15,440 2025 So I am wondering when this model 217 00:08:15,440 --> 00:08:17,120 is going to be released because we still 218 00:08:17,120 --> 00:08:19,599 do have many other open-source AIs like 219 00:08:19,599 --> 00:08:22,360 Quen coming soon and potentially even 220 00:08:22,360 --> 00:08:26,160 GPT5 and 03 Mini So it is super super 221 00:08:26,160 --> 00:08:28,319 important that Deepseek do maintain the 222 00:08:28,319 --> 00:08:30,400 momentum as it is quite hard to stay 223 00:08:30,400 --> 00:08:32,880 relevant in the AI space Now they do 224 00:08:32,880 --> 00:08:36,159 state that it is possible that they will 225 00:08:36,159 --> 00:08:39,279 be releasing the model as early as May 226 00:08:39,279 --> 00:08:41,279 because that was when their plan was 227 00:08:41,279 --> 00:08:43,519 They plan to release it in May but now 228 00:08:43,519 --> 00:08:46,080 they want it as early as possible So I 229 00:08:46,080 --> 00:08:47,920 wouldn't personally be surprised if we 230 00:08:47,920 --> 00:08:50,560 do get Deep Seek R2 potentially at the 231 00:08:50,560 --> 00:08:53,360 end of this month I mean it's a once in 232 00:08:53,360 --> 00:08:55,440 a-lifetime opportunity to put your 233 00:08:55,440 --> 00:08:58,320 company out even further ahead of OpenAI 234 00:08:58,320 --> 00:09:00,640 and steal the limelight once again I do 235 00:09:00,640 --> 00:09:02,080 think that this could potentially happen 236 00:09:02,080 --> 00:09:04,240 but of course we will have to see And 237 00:09:04,240 --> 00:09:06,080 individuals within the industry are 238 00:09:06,080 --> 00:09:08,240 saying that the launch of Deep Seek's R2 239 00:09:08,240 --> 00:09:10,480 model could be a pivotal moment in the 240 00:09:10,480 --> 00:09:12,800 AI industry And I do wonder how this 241 00:09:12,800 --> 00:09:15,360 news of Deepseek's further advancements 242 00:09:15,360 --> 00:09:17,360 are going to affect companies like Meta 243 00:09:17,360 --> 00:09:19,040 and their open- source efforts Of course 244 00:09:19,040 --> 00:09:20,320 like I said before if you watched 245 00:09:20,320 --> 00:09:22,480 yesterday's video you'll know that there 246 00:09:22,480 --> 00:09:24,800 was a ton of controversy around the 247 00:09:24,800 --> 00:09:26,959 Llama 4 release with many critics 248 00:09:26,959 --> 00:09:29,120 arguing that they fixed the benchmarks 249 00:09:29,120 --> 00:09:30,959 and the model just simply isn't that 250 00:09:30,959 --> 00:09:32,800 good So it will be interesting to see 251 00:09:32,800 --> 00:09:35,600 exactly how that announcement of R2 when 252 00:09:35,600 --> 00:09:39,360 it is released affects the AI industry