WEBVTT 00:00:00.080 --> 00:00:01.839 so are Deepseek creating a 00:00:01.839 --> 00:00:04.000 self-improving AI well that's the latest 00:00:04.000 --> 00:00:06.080 claim coming out of this newspaper and 00:00:06.080 --> 00:00:08.160 they're talking about the fact that last 00:00:08.160 --> 00:00:10.320 Friday Deepseek released a paper for 00:00:10.320 --> 00:00:13.120 self-improving AI models Now there are a 00:00:13.120 --> 00:00:15.040 little bit of caveats to this I mean 00:00:15.040 --> 00:00:16.800 there is some truth statements to this 00:00:16.800 --> 00:00:19.119 but I'm going to dive into how this all 00:00:19.119 --> 00:00:21.199 works and the truth about this claim So 00:00:21.199 --> 00:00:23.039 you can see right here that the article 00:00:23.039 --> 00:00:25.359 starts by saying "China's Deepseek finds 00:00:25.359 --> 00:00:27.760 a way to help AI get better at answering 00:00:27.760 --> 00:00:30.960 questions Here's how it works So this 00:00:30.960 --> 00:00:33.760 actually got the Twitter sphere in a 00:00:33.760 --> 00:00:35.920 hold because so many people were 00:00:35.920 --> 00:00:38.800 wondering exactly what on earth this is 00:00:38.800 --> 00:00:40.879 and how on earth DeepS are actually 00:00:40.879 --> 00:00:43.200 changing the game when it comes to 00:00:43.200 --> 00:00:45.760 having an AI improve itself Now we all 00:00:45.760 --> 00:00:48.399 know that Deepseek are an open research 00:00:48.399 --> 00:00:51.360 company or open source company and of 00:00:51.360 --> 00:00:53.199 course that means that they publish 00:00:53.199 --> 00:00:55.680 their research So right here you can see 00:00:55.680 --> 00:00:58.640 exactly the research that was published 00:00:58.640 --> 00:01:00.640 and it talks about inference time 00:01:00.640 --> 00:01:03.520 scaling for generous reward modeling Now 00:01:03.520 --> 00:01:05.040 trust me guys I'm not going to bore you 00:01:05.040 --> 00:01:06.799 with the specifics of everything I'll 00:01:06.799 --> 00:01:08.640 actually explain this to you in a way 00:01:08.640 --> 00:01:11.280 that you can understand But the results 00:01:11.280 --> 00:01:14.320 that we are seeing is that yes over time 00:01:14.320 --> 00:01:17.520 with inference time scaling the model 00:01:17.520 --> 00:01:20.159 does improve So what we can see here and 00:01:20.159 --> 00:01:21.759 trust me guys it's actually not that 00:01:21.759 --> 00:01:24.240 hard to understand is that every time 00:01:24.240 --> 00:01:26.400 you sample the model we show how 00:01:26.400 --> 00:01:28.799 accurate that rating becomes So let's 00:01:28.799 --> 00:01:31.119 say for example we were asking the AI 00:01:31.119 --> 00:01:32.799 assistants which are the reward models 00:01:32.799 --> 00:01:35.280 in this case to rate how good this AI's 00:01:35.280 --> 00:01:37.119 response was The graph here is showing 00:01:37.119 --> 00:01:39.520 how accurate these results get over time 00:01:39.520 --> 00:01:41.680 The numbers here show the AI's 00:01:41.680 --> 00:01:43.280 performance and the numbers here show 00:01:43.280 --> 00:01:45.280 how many times it tried Now this isn't 00:01:45.280 --> 00:01:47.280 the only graph that shows you just how 00:01:47.280 --> 00:01:50.000 incredible this is because I think it's 00:01:50.000 --> 00:01:51.680 really important that they're also 00:01:51.680 --> 00:01:54.000 showing us the other models here And 00:01:54.000 --> 00:01:56.799 they show GPT4 here which is really 00:01:56.799 --> 00:01:58.640 interesting Although I will say I'm not 00:01:58.640 --> 00:02:01.360 sure which version of GPT40 this is as 00:02:01.360 --> 00:02:03.079 there are multiple different versions of 00:02:03.079 --> 00:02:06.640 GPT40 that currently do exist So this 00:02:06.640 --> 00:02:09.039 one outperforming it with inference time 00:02:09.039 --> 00:02:11.160 is a little bit of a caveat because 00:02:11.160 --> 00:02:13.760 GPT40 isn't actually an inference time 00:02:13.760 --> 00:02:15.760 model I'm guessing they're just showing 00:02:15.760 --> 00:02:18.000 just how good the model gets in terms of 00:02:18.000 --> 00:02:20.400 overall performance So actually we'll 00:02:20.400 --> 00:02:22.000 explain to you guys how this works 00:02:22.000 --> 00:02:24.080 because I think this probably is going 00:02:24.080 --> 00:02:27.040 to be used in the base of the next model 00:02:27.040 --> 00:02:29.440 Deepseek R2 which is probably going to 00:02:29.440 --> 00:02:31.760 be Deep Seek's next Frontier model that 00:02:31.760 --> 00:02:34.480 could outperform the next level of AI So 00:02:34.480 --> 00:02:36.640 how this entire thing works and I'm just 00:02:36.640 --> 00:02:38.560 going to break this down for you guys is 00:02:38.560 --> 00:02:41.120 the goal here is to get the AI to 00:02:41.120 --> 00:02:43.040 improve Let's just say in this example 00:02:43.040 --> 00:02:45.599 chat GBT So now if they want to get the 00:02:45.599 --> 00:02:48.160 AI to improve is to train it to have 00:02:48.160 --> 00:02:51.200 another AI actually judge how good the 00:02:51.200 --> 00:02:53.599 answers are And this judge is basically 00:02:53.599 --> 00:02:56.720 called a reward model And this AI paper 00:02:56.720 --> 00:02:59.120 basically wants to create a very good 00:02:59.120 --> 00:03:01.599 versatile judge Now there's a few 00:03:01.599 --> 00:03:04.319 problems with current AI judges And 00:03:04.319 --> 00:03:06.080 number one is that they're not general 00:03:06.080 --> 00:03:08.000 enough They might be good at judging 00:03:08.000 --> 00:03:10.800 math answers but when you know you judge 00:03:10.800 --> 00:03:13.519 creative answers or vice versa they just 00:03:13.519 --> 00:03:15.440 don't perform that well And now another 00:03:15.440 --> 00:03:18.319 thing as well is that these AI judges 00:03:18.319 --> 00:03:20.720 they don't improve that much on the spot 00:03:20.720 --> 00:03:23.360 So giving them more computer power with 00:03:23.360 --> 00:03:24.879 you know inference time doesn't 00:03:24.879 --> 00:03:27.040 necessarily make their judgment much 00:03:27.040 --> 00:03:30.239 better They actually improve most times 00:03:30.239 --> 00:03:32.560 after training And so because we have 00:03:32.560 --> 00:03:35.360 these ingrained issues into the judges 00:03:35.360 --> 00:03:36.959 when we're trying to improve the model's 00:03:36.959 --> 00:03:39.840 responses with these reward models we of 00:03:39.840 --> 00:03:41.760 course have to come up with a better 00:03:41.760 --> 00:03:44.319 solution And that's where Deepseek 00:03:44.319 --> 00:03:46.560 solution comes into the picture This is 00:03:46.560 --> 00:03:49.599 where we get the GRM judge So the paper 00:03:49.599 --> 00:03:52.879 solution is called Deepseek GRM Now this 00:03:52.879 --> 00:03:54.959 is a different kind of judge Instead of 00:03:54.959 --> 00:03:56.879 just outputting a score like seven out 00:03:56.879 --> 00:03:59.120 of 10 their judge writes out the 00:03:59.120 --> 00:04:02.159 reasoning and then it explains why an 00:04:02.159 --> 00:04:04.959 answer is good or bad based on specific 00:04:04.959 --> 00:04:06.560 principles and these are going to be the 00:04:06.560 --> 00:04:08.480 rules that it manages to come up with 00:04:08.480 --> 00:04:10.799 Then from that reasoning the score is 00:04:10.799 --> 00:04:12.560 then extracted Now the reason that 00:04:12.560 --> 00:04:14.400 they've chosen to do this is because 00:04:14.400 --> 00:04:16.560 this is more flexible It's more detailed 00:04:16.560 --> 00:04:18.479 and more importantly if you ask it to 00:04:18.479 --> 00:04:20.560 judge the same thing multiple times it 00:04:20.560 --> 00:04:22.960 might write multiple slightly different 00:04:22.960 --> 00:04:24.880 reasons or principles leading to 00:04:24.880 --> 00:04:26.720 slightly different scores And so what 00:04:26.720 --> 00:04:28.800 they actually use is they train this 00:04:28.800 --> 00:04:32.400 judge with SPCT So this is where they 00:04:32.400 --> 00:04:34.240 use reinforcement learning which is 00:04:34.240 --> 00:04:36.000 where you know you can like train an AI 00:04:36.000 --> 00:04:37.440 to play a game which is like training an 00:04:37.440 --> 00:04:39.759 AI to play a game The judge AI practices 00:04:39.759 --> 00:04:41.759 generating principles and critiques and 00:04:41.759 --> 00:04:44.160 its final judgment is based on if it 00:04:44.160 --> 00:04:46.080 matches the correct judgment and then of 00:04:46.080 --> 00:04:48.160 course it gets a reward and it learns to 00:04:48.160 --> 00:04:49.680 do that more So over time they're 00:04:49.680 --> 00:04:52.720 reinforcing the good behavior from this 00:04:52.720 --> 00:04:54.479 model And the reason they're doing this 00:04:54.479 --> 00:04:57.280 is because this teaches the judge AI to 00:04:57.280 --> 00:04:59.600 generate good principles and critiques 00:04:59.600 --> 00:05:01.680 that lead to accurate judgments that 00:05:01.680 --> 00:05:04.240 over the time make it smarter Now of 00:05:04.240 --> 00:05:06.720 course making it better on the spot is 00:05:06.720 --> 00:05:09.680 where we have the inference scaling So 00:05:09.680 --> 00:05:11.759 you know that famous famous paradigm 00:05:11.759 --> 00:05:14.320 that everyone is talking about is once 00:05:14.320 --> 00:05:16.720 again making headwinds again So what 00:05:16.720 --> 00:05:18.960 they do here is they ask it multiple 00:05:18.960 --> 00:05:21.120 times which is called sampling So when 00:05:21.120 --> 00:05:23.520 they want a judge over here to get an 00:05:23.520 --> 00:05:25.759 answer they ask their trained judge 00:05:25.759 --> 00:05:27.520 multiple times So they'll ask it eight 00:05:27.520 --> 00:05:30.320 or 32 times Then what they'll do is 00:05:30.320 --> 00:05:32.400 they'll then combine those judgments 00:05:32.400 --> 00:05:34.240 which is where they'll have voting So 00:05:34.240 --> 00:05:36.080 they'll collect all the scores from the 00:05:36.080 --> 00:05:38.639 multiple tries and combine them into you 00:05:38.639 --> 00:05:40.880 know I guess you could say an average 00:05:40.880 --> 00:05:41.919 And all of this is happening at 00:05:41.919 --> 00:05:44.160 inference time They then have smart 00:05:44.160 --> 00:05:46.240 combining which is where they trained 00:05:46.240 --> 00:05:50.000 another tiny AI called a metarm whose 00:05:50.000 --> 00:05:52.000 only job is to quickly rate how good 00:05:52.000 --> 00:05:53.840 each written critique from the main 00:05:53.840 --> 00:05:56.479 judge was Then they combine the scores 00:05:56.479 --> 00:05:58.240 only from the critiques that the Metar 00:05:58.240 --> 00:06:00.080 RM thought were good And this is because 00:06:00.080 --> 00:06:01.840 asking multiple times and combining 00:06:01.840 --> 00:06:04.240 results especially the meta RM helper 00:06:04.240 --> 00:06:06.000 makes the final judgment much more 00:06:06.000 --> 00:06:08.080 reliable than just asking once And this 00:06:08.080 --> 00:06:11.039 is much more compute inensive but it 00:06:11.039 --> 00:06:12.960 does give a better result Now overall 00:06:12.960 --> 00:06:14.880 you might be thinking all right that is 00:06:14.880 --> 00:06:16.639 pretty interesting but what are the 00:06:16.639 --> 00:06:18.400 results here so the results are that 00:06:18.400 --> 00:06:20.560 this of course performs really well The 00:06:20.560 --> 00:06:23.680 AI judge works very well across many 00:06:23.680 --> 00:06:25.840 different tasks The ask multiple times 00:06:25.840 --> 00:06:29.280 strategy dramatically improves the AI's 00:06:29.280 --> 00:06:31.759 accuracy and the more times they ask it 00:06:31.759 --> 00:06:33.840 or the more compute they use the better 00:06:33.840 --> 00:06:36.919 it gets And their judge AI even though 00:06:36.919 --> 00:06:39.840 medium-sized could outperform much 00:06:39.840 --> 00:06:41.560 larger AIs like 00:06:41.560 --> 00:06:44.479 GPT40 used as a judge And if those 00:06:44.479 --> 00:06:46.880 larger AIs were only asked once so 00:06:46.880 --> 00:06:48.880 overall it's looking really good And 00:06:48.880 --> 00:06:51.280 using the meta RM that tiny AI that 00:06:51.280 --> 00:06:52.720 little helper to pick out the best 00:06:52.720 --> 00:06:54.960 critiques works even better than simple 00:06:54.960 --> 00:06:56.639 voting They basically built a really 00:06:56.639 --> 00:06:58.400 smart AI judge that explains its 00:06:58.400 --> 00:07:00.240 reasoning They trained it in a really 00:07:00.240 --> 00:07:02.479 clever way And crucially this judge gets 00:07:02.479 --> 00:07:04.319 better if you let it think multiple 00:07:04.319 --> 00:07:06.400 times which is just the same as using 00:07:06.400 --> 00:07:07.919 more compute And this allows a 00:07:07.919 --> 00:07:10.400 moderately sized AI judge to achieve top 00:07:10.400 --> 00:07:13.440 performance when needed So overall this 00:07:13.440 --> 00:07:15.520 is once again something really 00:07:15.520 --> 00:07:17.120 interesting because we know that 00:07:17.120 --> 00:07:19.199 Deepseek are continuing to push the 00:07:19.199 --> 00:07:21.120 frontier in terms of innovation And I 00:07:21.120 --> 00:07:23.280 will say it's really interesting that 00:07:23.280 --> 00:07:25.759 we're seeing that the tides are shifting 00:07:25.759 --> 00:07:27.759 It's not like China is copying the West 00:07:27.759 --> 00:07:29.520 anymore They're actually innovating on 00:07:29.520 --> 00:07:31.759 their own front Now one thing that I do 00:07:31.759 --> 00:07:33.759 want to talk about is of course the fact 00:07:33.759 --> 00:07:37.039 that R2 is coming We do know that Deep 00:07:37.039 --> 00:07:39.360 Seek's new Frontier models are going to 00:07:39.360 --> 00:07:43.039 be coming fairly soon And with this 00:07:43.039 --> 00:07:45.840 recent research paper I am wondering if 00:07:45.840 --> 00:07:48.000 that is going to be part of their next 00:07:48.000 --> 00:07:50.080 AI release We know that they are working 00:07:50.080 --> 00:07:52.240 super hard and some would argue that 00:07:52.240 --> 00:07:54.400 they're even ahead of recent Frontier 00:07:54.400 --> 00:07:56.879 Labs I mean a lot of people know about 00:07:56.879 --> 00:07:59.520 Meta's Llama 4 and how there have been 00:07:59.520 --> 00:08:01.840 many many different mishaps with regards 00:08:01.840 --> 00:08:04.879 to how that model has performed So 00:08:04.879 --> 00:08:07.599 Deepseek rushing to launch a new AI 00:08:07.599 --> 00:08:09.120 model and they are going to be going 00:08:09.120 --> 00:08:12.360 allin and this article was in February 00:08:12.360 --> 00:08:15.440 2025 So I am wondering when this model 00:08:15.440 --> 00:08:17.120 is going to be released because we still 00:08:17.120 --> 00:08:19.599 do have many other open-source AIs like 00:08:19.599 --> 00:08:22.360 Quen coming soon and potentially even 00:08:22.360 --> 00:08:26.160 GPT5 and 03 Mini So it is super super 00:08:26.160 --> 00:08:28.319 important that Deepseek do maintain the 00:08:28.319 --> 00:08:30.400 momentum as it is quite hard to stay 00:08:30.400 --> 00:08:32.880 relevant in the AI space Now they do 00:08:32.880 --> 00:08:36.159 state that it is possible that they will 00:08:36.159 --> 00:08:39.279 be releasing the model as early as May 00:08:39.279 --> 00:08:41.279 because that was when their plan was 00:08:41.279 --> 00:08:43.519 They plan to release it in May but now 00:08:43.519 --> 00:08:46.080 they want it as early as possible So I 00:08:46.080 --> 00:08:47.920 wouldn't personally be surprised if we 00:08:47.920 --> 00:08:50.560 do get Deep Seek R2 potentially at the 00:08:50.560 --> 00:08:53.360 end of this month I mean it's a once in 00:08:53.360 --> 00:08:55.440 a-lifetime opportunity to put your 00:08:55.440 --> 00:08:58.320 company out even further ahead of OpenAI 00:08:58.320 --> 00:09:00.640 and steal the limelight once again I do 00:09:00.640 --> 00:09:02.080 think that this could potentially happen 00:09:02.080 --> 00:09:04.240 but of course we will have to see And 00:09:04.240 --> 00:09:06.080 individuals within the industry are 00:09:06.080 --> 00:09:08.240 saying that the launch of Deep Seek's R2 00:09:08.240 --> 00:09:10.480 model could be a pivotal moment in the 00:09:10.480 --> 00:09:12.800 AI industry And I do wonder how this 00:09:12.800 --> 00:09:15.360 news of Deepseek's further advancements 00:09:15.360 --> 00:09:17.360 are going to affect companies like Meta 00:09:17.360 --> 00:09:19.040 and their open- source efforts Of course 00:09:19.040 --> 00:09:20.320 like I said before if you watched 00:09:20.320 --> 00:09:22.480 yesterday's video you'll know that there 00:09:22.480 --> 00:09:24.800 was a ton of controversy around the 00:09:24.800 --> 00:09:26.959 Llama 4 release with many critics 00:09:26.959 --> 00:09:29.120 arguing that they fixed the benchmarks 00:09:29.120 --> 00:09:30.959 and the model just simply isn't that 00:09:30.959 --> 00:09:32.800 good So it will be interesting to see 00:09:32.800 --> 00:09:35.600 exactly how that announcement of R2 when 00:09:35.600 --> 00:09:39.360 it is released affects the AI industry