0:00:00.080,0:00:01.839 so are Deepseek creating a 0:00:01.839,0:00:04.000 self-improving AI well that's the latest 0:00:04.000,0:00:06.080 claim coming out of this newspaper and 0:00:06.080,0:00:08.160 they're talking about the fact that last 0:00:08.160,0:00:10.320 Friday Deepseek released a paper for 0:00:10.320,0:00:13.120 self-improving AI models Now there are a 0:00:13.120,0:00:15.040 little bit of caveats to this I mean 0:00:15.040,0:00:16.800 there is some truth statements to this 0:00:16.800,0:00:19.119 but I'm going to dive into how this all 0:00:19.119,0:00:21.199 works and the truth about this claim So 0:00:21.199,0:00:23.039 you can see right here that the article 0:00:23.039,0:00:25.359 starts by saying "China's Deepseek finds 0:00:25.359,0:00:27.760 a way to help AI get better at answering 0:00:27.760,0:00:30.960 questions Here's how it works So this 0:00:30.960,0:00:33.760 actually got the Twitter sphere in a 0:00:33.760,0:00:35.920 hold because so many people were 0:00:35.920,0:00:38.800 wondering exactly what on earth this is 0:00:38.800,0:00:40.879 and how on earth DeepS are actually 0:00:40.879,0:00:43.200 changing the game when it comes to 0:00:43.200,0:00:45.760 having an AI improve itself Now we all 0:00:45.760,0:00:48.399 know that Deepseek are an open research 0:00:48.399,0:00:51.360 company or open source company and of 0:00:51.360,0:00:53.199 course that means that they publish 0:00:53.199,0:00:55.680 their research So right here you can see 0:00:55.680,0:00:58.640 exactly the research that was published 0:00:58.640,0:01:00.640 and it talks about inference time 0:01:00.640,0:01:03.520 scaling for generous reward modeling Now 0:01:03.520,0:01:05.040 trust me guys I'm not going to bore you 0:01:05.040,0:01:06.799 with the specifics of everything I'll 0:01:06.799,0:01:08.640 actually explain this to you in a way 0:01:08.640,0:01:11.280 that you can understand But the results 0:01:11.280,0:01:14.320 that we are seeing is that yes over time 0:01:14.320,0:01:17.520 with inference time scaling the model 0:01:17.520,0:01:20.159 does improve So what we can see here and 0:01:20.159,0:01:21.759 trust me guys it's actually not that 0:01:21.759,0:01:24.240 hard to understand is that every time 0:01:24.240,0:01:26.400 you sample the model we show how 0:01:26.400,0:01:28.799 accurate that rating becomes So let's 0:01:28.799,0:01:31.119 say for example we were asking the AI 0:01:31.119,0:01:32.799 assistants which are the reward models 0:01:32.799,0:01:35.280 in this case to rate how good this AI's 0:01:35.280,0:01:37.119 response was The graph here is showing 0:01:37.119,0:01:39.520 how accurate these results get over time 0:01:39.520,0:01:41.680 The numbers here show the AI's 0:01:41.680,0:01:43.280 performance and the numbers here show 0:01:43.280,0:01:45.280 how many times it tried Now this isn't 0:01:45.280,0:01:47.280 the only graph that shows you just how 0:01:47.280,0:01:50.000 incredible this is because I think it's 0:01:50.000,0:01:51.680 really important that they're also 0:01:51.680,0:01:54.000 showing us the other models here And 0:01:54.000,0:01:56.799 they show GPT4 here which is really 0:01:56.799,0:01:58.640 interesting Although I will say I'm not 0:01:58.640,0:02:01.360 sure which version of GPT40 this is as 0:02:01.360,0:02:03.079 there are multiple different versions of 0:02:03.079,0:02:06.640 GPT40 that currently do exist So this 0:02:06.640,0:02:09.039 one outperforming it with inference time 0:02:09.039,0:02:11.160 is a little bit of a caveat because 0:02:11.160,0:02:13.760 GPT40 isn't actually an inference time 0:02:13.760,0:02:15.760 model I'm guessing they're just showing 0:02:15.760,0:02:18.000 just how good the model gets in terms of 0:02:18.000,0:02:20.400 overall performance So actually we'll 0:02:20.400,0:02:22.000 explain to you guys how this works 0:02:22.000,0:02:24.080 because I think this probably is going 0:02:24.080,0:02:27.040 to be used in the base of the next model 0:02:27.040,0:02:29.440 Deepseek R2 which is probably going to 0:02:29.440,0:02:31.760 be Deep Seek's next Frontier model that 0:02:31.760,0:02:34.480 could outperform the next level of AI So 0:02:34.480,0:02:36.640 how this entire thing works and I'm just 0:02:36.640,0:02:38.560 going to break this down for you guys is 0:02:38.560,0:02:41.120 the goal here is to get the AI to 0:02:41.120,0:02:43.040 improve Let's just say in this example 0:02:43.040,0:02:45.599 chat GBT So now if they want to get the 0:02:45.599,0:02:48.160 AI to improve is to train it to have 0:02:48.160,0:02:51.200 another AI actually judge how good the 0:02:51.200,0:02:53.599 answers are And this judge is basically 0:02:53.599,0:02:56.720 called a reward model And this AI paper 0:02:56.720,0:02:59.120 basically wants to create a very good 0:02:59.120,0:03:01.599 versatile judge Now there's a few 0:03:01.599,0:03:04.319 problems with current AI judges And 0:03:04.319,0:03:06.080 number one is that they're not general 0:03:06.080,0:03:08.000 enough They might be good at judging 0:03:08.000,0:03:10.800 math answers but when you know you judge 0:03:10.800,0:03:13.519 creative answers or vice versa they just 0:03:13.519,0:03:15.440 don't perform that well And now another 0:03:15.440,0:03:18.319 thing as well is that these AI judges 0:03:18.319,0:03:20.720 they don't improve that much on the spot 0:03:20.720,0:03:23.360 So giving them more computer power with 0:03:23.360,0:03:24.879 you know inference time doesn't 0:03:24.879,0:03:27.040 necessarily make their judgment much 0:03:27.040,0:03:30.239 better They actually improve most times 0:03:30.239,0:03:32.560 after training And so because we have 0:03:32.560,0:03:35.360 these ingrained issues into the judges 0:03:35.360,0:03:36.959 when we're trying to improve the model's 0:03:36.959,0:03:39.840 responses with these reward models we of 0:03:39.840,0:03:41.760 course have to come up with a better 0:03:41.760,0:03:44.319 solution And that's where Deepseek 0:03:44.319,0:03:46.560 solution comes into the picture This is 0:03:46.560,0:03:49.599 where we get the GRM judge So the paper 0:03:49.599,0:03:52.879 solution is called Deepseek GRM Now this 0:03:52.879,0:03:54.959 is a different kind of judge Instead of 0:03:54.959,0:03:56.879 just outputting a score like seven out 0:03:56.879,0:03:59.120 of 10 their judge writes out the 0:03:59.120,0:04:02.159 reasoning and then it explains why an 0:04:02.159,0:04:04.959 answer is good or bad based on specific 0:04:04.959,0:04:06.560 principles and these are going to be the 0:04:06.560,0:04:08.480 rules that it manages to come up with 0:04:08.480,0:04:10.799 Then from that reasoning the score is 0:04:10.799,0:04:12.560 then extracted Now the reason that 0:04:12.560,0:04:14.400 they've chosen to do this is because 0:04:14.400,0:04:16.560 this is more flexible It's more detailed 0:04:16.560,0:04:18.479 and more importantly if you ask it to 0:04:18.479,0:04:20.560 judge the same thing multiple times it 0:04:20.560,0:04:22.960 might write multiple slightly different 0:04:22.960,0:04:24.880 reasons or principles leading to 0:04:24.880,0:04:26.720 slightly different scores And so what 0:04:26.720,0:04:28.800 they actually use is they train this 0:04:28.800,0:04:32.400 judge with SPCT So this is where they 0:04:32.400,0:04:34.240 use reinforcement learning which is 0:04:34.240,0:04:36.000 where you know you can like train an AI 0:04:36.000,0:04:37.440 to play a game which is like training an 0:04:37.440,0:04:39.759 AI to play a game The judge AI practices 0:04:39.759,0:04:41.759 generating principles and critiques and 0:04:41.759,0:04:44.160 its final judgment is based on if it 0:04:44.160,0:04:46.080 matches the correct judgment and then of 0:04:46.080,0:04:48.160 course it gets a reward and it learns to 0:04:48.160,0:04:49.680 do that more So over time they're 0:04:49.680,0:04:52.720 reinforcing the good behavior from this 0:04:52.720,0:04:54.479 model And the reason they're doing this 0:04:54.479,0:04:57.280 is because this teaches the judge AI to 0:04:57.280,0:04:59.600 generate good principles and critiques 0:04:59.600,0:05:01.680 that lead to accurate judgments that 0:05:01.680,0:05:04.240 over the time make it smarter Now of 0:05:04.240,0:05:06.720 course making it better on the spot is 0:05:06.720,0:05:09.680 where we have the inference scaling So 0:05:09.680,0:05:11.759 you know that famous famous paradigm 0:05:11.759,0:05:14.320 that everyone is talking about is once 0:05:14.320,0:05:16.720 again making headwinds again So what 0:05:16.720,0:05:18.960 they do here is they ask it multiple 0:05:18.960,0:05:21.120 times which is called sampling So when 0:05:21.120,0:05:23.520 they want a judge over here to get an 0:05:23.520,0:05:25.759 answer they ask their trained judge 0:05:25.759,0:05:27.520 multiple times So they'll ask it eight 0:05:27.520,0:05:30.320 or 32 times Then what they'll do is 0:05:30.320,0:05:32.400 they'll then combine those judgments 0:05:32.400,0:05:34.240 which is where they'll have voting So 0:05:34.240,0:05:36.080 they'll collect all the scores from the 0:05:36.080,0:05:38.639 multiple tries and combine them into you 0:05:38.639,0:05:40.880 know I guess you could say an average 0:05:40.880,0:05:41.919 And all of this is happening at 0:05:41.919,0:05:44.160 inference time They then have smart 0:05:44.160,0:05:46.240 combining which is where they trained 0:05:46.240,0:05:50.000 another tiny AI called a metarm whose 0:05:50.000,0:05:52.000 only job is to quickly rate how good 0:05:52.000,0:05:53.840 each written critique from the main 0:05:53.840,0:05:56.479 judge was Then they combine the scores 0:05:56.479,0:05:58.240 only from the critiques that the Metar 0:05:58.240,0:06:00.080 RM thought were good And this is because 0:06:00.080,0:06:01.840 asking multiple times and combining 0:06:01.840,0:06:04.240 results especially the meta RM helper 0:06:04.240,0:06:06.000 makes the final judgment much more 0:06:06.000,0:06:08.080 reliable than just asking once And this 0:06:08.080,0:06:11.039 is much more compute inensive but it 0:06:11.039,0:06:12.960 does give a better result Now overall 0:06:12.960,0:06:14.880 you might be thinking all right that is 0:06:14.880,0:06:16.639 pretty interesting but what are the 0:06:16.639,0:06:18.400 results here so the results are that 0:06:18.400,0:06:20.560 this of course performs really well The 0:06:20.560,0:06:23.680 AI judge works very well across many 0:06:23.680,0:06:25.840 different tasks The ask multiple times 0:06:25.840,0:06:29.280 strategy dramatically improves the AI's 0:06:29.280,0:06:31.759 accuracy and the more times they ask it 0:06:31.759,0:06:33.840 or the more compute they use the better 0:06:33.840,0:06:36.919 it gets And their judge AI even though 0:06:36.919,0:06:39.840 medium-sized could outperform much 0:06:39.840,0:06:41.560 larger AIs like 0:06:41.560,0:06:44.479 GPT40 used as a judge And if those 0:06:44.479,0:06:46.880 larger AIs were only asked once so 0:06:46.880,0:06:48.880 overall it's looking really good And 0:06:48.880,0:06:51.280 using the meta RM that tiny AI that 0:06:51.280,0:06:52.720 little helper to pick out the best 0:06:52.720,0:06:54.960 critiques works even better than simple 0:06:54.960,0:06:56.639 voting They basically built a really 0:06:56.639,0:06:58.400 smart AI judge that explains its 0:06:58.400,0:07:00.240 reasoning They trained it in a really 0:07:00.240,0:07:02.479 clever way And crucially this judge gets 0:07:02.479,0:07:04.319 better if you let it think multiple 0:07:04.319,0:07:06.400 times which is just the same as using 0:07:06.400,0:07:07.919 more compute And this allows a 0:07:07.919,0:07:10.400 moderately sized AI judge to achieve top 0:07:10.400,0:07:13.440 performance when needed So overall this 0:07:13.440,0:07:15.520 is once again something really 0:07:15.520,0:07:17.120 interesting because we know that 0:07:17.120,0:07:19.199 Deepseek are continuing to push the 0:07:19.199,0:07:21.120 frontier in terms of innovation And I 0:07:21.120,0:07:23.280 will say it's really interesting that 0:07:23.280,0:07:25.759 we're seeing that the tides are shifting 0:07:25.759,0:07:27.759 It's not like China is copying the West 0:07:27.759,0:07:29.520 anymore They're actually innovating on 0:07:29.520,0:07:31.759 their own front Now one thing that I do 0:07:31.759,0:07:33.759 want to talk about is of course the fact 0:07:33.759,0:07:37.039 that R2 is coming We do know that Deep 0:07:37.039,0:07:39.360 Seek's new Frontier models are going to 0:07:39.360,0:07:43.039 be coming fairly soon And with this 0:07:43.039,0:07:45.840 recent research paper I am wondering if 0:07:45.840,0:07:48.000 that is going to be part of their next 0:07:48.000,0:07:50.080 AI release We know that they are working 0:07:50.080,0:07:52.240 super hard and some would argue that 0:07:52.240,0:07:54.400 they're even ahead of recent Frontier 0:07:54.400,0:07:56.879 Labs I mean a lot of people know about 0:07:56.879,0:07:59.520 Meta's Llama 4 and how there have been 0:07:59.520,0:08:01.840 many many different mishaps with regards 0:08:01.840,0:08:04.879 to how that model has performed So 0:08:04.879,0:08:07.599 Deepseek rushing to launch a new AI 0:08:07.599,0:08:09.120 model and they are going to be going 0:08:09.120,0:08:12.360 allin and this article was in February 0:08:12.360,0:08:15.440 2025 So I am wondering when this model 0:08:15.440,0:08:17.120 is going to be released because we still 0:08:17.120,0:08:19.599 do have many other open-source AIs like 0:08:19.599,0:08:22.360 Quen coming soon and potentially even 0:08:22.360,0:08:26.160 GPT5 and 03 Mini So it is super super 0:08:26.160,0:08:28.319 important that Deepseek do maintain the 0:08:28.319,0:08:30.400 momentum as it is quite hard to stay 0:08:30.400,0:08:32.880 relevant in the AI space Now they do 0:08:32.880,0:08:36.159 state that it is possible that they will 0:08:36.159,0:08:39.279 be releasing the model as early as May 0:08:39.279,0:08:41.279 because that was when their plan was 0:08:41.279,0:08:43.519 They plan to release it in May but now 0:08:43.519,0:08:46.080 they want it as early as possible So I 0:08:46.080,0:08:47.920 wouldn't personally be surprised if we 0:08:47.920,0:08:50.560 do get Deep Seek R2 potentially at the 0:08:50.560,0:08:53.360 end of this month I mean it's a once in 0:08:53.360,0:08:55.440 a-lifetime opportunity to put your 0:08:55.440,0:08:58.320 company out even further ahead of OpenAI 0:08:58.320,0:09:00.640 and steal the limelight once again I do 0:09:00.640,0:09:02.080 think that this could potentially happen 0:09:02.080,0:09:04.240 but of course we will have to see And 0:09:04.240,0:09:06.080 individuals within the industry are 0:09:06.080,0:09:08.240 saying that the launch of Deep Seek's R2 0:09:08.240,0:09:10.480 model could be a pivotal moment in the 0:09:10.480,0:09:12.800 AI industry And I do wonder how this 0:09:12.800,0:09:15.360 news of Deepseek's further advancements 0:09:15.360,0:09:17.360 are going to affect companies like Meta 0:09:17.360,0:09:19.040 and their open- source efforts Of course 0:09:19.040,0:09:20.320 like I said before if you watched 0:09:20.320,0:09:22.480 yesterday's video you'll know that there 0:09:22.480,0:09:24.800 was a ton of controversy around the 0:09:24.800,0:09:26.959 Llama 4 release with many critics 0:09:26.959,0:09:29.120 arguing that they fixed the benchmarks 0:09:29.120,0:09:30.959 and the model just simply isn't that 0:09:30.959,0:09:32.800 good So it will be interesting to see 0:09:32.800,0:09:35.600 exactly how that announcement of R2 when 0:09:35.600,0:09:39.360 it is released affects the AI industry