so are Deepseek creating a self-improving AI well that's the latest claim coming out of this newspaper and they're talking about the fact that last Friday Deepseek released a paper for self-improving AI models Now there are a little bit of caveats to this I mean there is some truth statements to this but I'm going to dive into how this all works and the truth about this claim So you can see right here that the article starts by saying "China's Deepseek finds a way to help AI get better at answering questions Here's how it works So this actually got the Twitter sphere in a hold because so many people were wondering exactly what on earth this is and how on earth DeepS are actually changing the game when it comes to having an AI improve itself Now we all know that Deepseek are an open research company or open source company and of course that means that they publish their research So right here you can see exactly the research that was published and it talks about inference time scaling for generous reward modeling Now trust me guys I'm not going to bore you with the specifics of everything I'll actually explain this to you in a way that you can understand But the results that we are seeing is that yes over time with inference time scaling the model does improve So what we can see here and trust me guys it's actually not that hard to understand is that every time you sample the model we show how accurate that rating becomes So let's say for example we were asking the AI assistants which are the reward models in this case to rate how good this AI's response was The graph here is showing how accurate these results get over time The numbers here show the AI's performance and the numbers here show how many times it tried Now this isn't the only graph that shows you just how incredible this is because I think it's really important that they're also showing us the other models here And they show GPT4 here which is really interesting Although I will say I'm not sure which version of GPT40 this is as there are multiple different versions of GPT40 that currently do exist So this one outperforming it with inference time is a little bit of a caveat because GPT40 isn't actually an inference time model I'm guessing they're just showing just how good the model gets in terms of overall performance So actually we'll explain to you guys how this works because I think this probably is going to be used in the base of the next model Deepseek R2 which is probably going to be Deep Seek's next Frontier model that could outperform the next level of AI So how this entire thing works and I'm just going to break this down for you guys is the goal here is to get the AI to improve Let's just say in this example chat GBT So now if they want to get the AI to improve is to train it to have another AI actually judge how good the answers are And this judge is basically called a reward model And this AI paper basically wants to create a very good versatile judge Now there's a few problems with current AI judges And number one is that they're not general enough They might be good at judging math answers but when you know you judge creative answers or vice versa they just don't perform that well And now another thing as well is that these AI judges they don't improve that much on the spot So giving them more computer power with you know inference time doesn't necessarily make their judgment much better They actually improve most times after training And so because we have these ingrained issues into the judges when we're trying to improve the model's responses with these reward models we of course have to come up with a better solution And that's where Deepseek solution comes into the picture This is where we get the GRM judge So the paper solution is called Deepseek GRM Now this is a different kind of judge Instead of just outputting a score like seven out of 10 their judge writes out the reasoning and then it explains why an answer is good or bad based on specific principles and these are going to be the rules that it manages to come up with Then from that reasoning the score is then extracted Now the reason that they've chosen to do this is because this is more flexible It's more detailed and more importantly if you ask it to judge the same thing multiple times it might write multiple slightly different reasons or principles leading to slightly different scores And so what they actually use is they train this judge with SPCT So this is where they use reinforcement learning which is where you know you can like train an AI to play a game which is like training an AI to play a game The judge AI practices generating principles and critiques and its final judgment is based on if it matches the correct judgment and then of course it gets a reward and it learns to do that more So over time they're reinforcing the good behavior from this model And the reason they're doing this is because this teaches the judge AI to generate good principles and critiques that lead to accurate judgments that over the time make it smarter Now of course making it better on the spot is where we have the inference scaling So you know that famous famous paradigm that everyone is talking about is once again making headwinds again So what they do here is they ask it multiple times which is called sampling So when they want a judge over here to get an answer they ask their trained judge multiple times So they'll ask it eight or 32 times Then what they'll do is they'll then combine those judgments which is where they'll have voting So they'll collect all the scores from the multiple tries and combine them into you know I guess you could say an average And all of this is happening at inference time They then have smart combining which is where they trained another tiny AI called a metarm whose only job is to quickly rate how good each written critique from the main judge was Then they combine the scores only from the critiques that the Metar RM thought were good And this is because asking multiple times and combining results especially the meta RM helper makes the final judgment much more reliable than just asking once And this is much more compute inensive but it does give a better result Now overall you might be thinking all right that is pretty interesting but what are the results here so the results are that this of course performs really well The AI judge works very well across many different tasks The ask multiple times strategy dramatically improves the AI's accuracy and the more times they ask it or the more compute they use the better it gets And their judge AI even though medium-sized could outperform much larger AIs like GPT40 used as a judge And if those larger AIs were only asked once so overall it's looking really good And using the meta RM that tiny AI that little helper to pick out the best critiques works even better than simple voting They basically built a really smart AI judge that explains its reasoning They trained it in a really clever way And crucially this judge gets better if you let it think multiple times which is just the same as using more compute And this allows a moderately sized AI judge to achieve top performance when needed So overall this is once again something really interesting because we know that Deepseek are continuing to push the frontier in terms of innovation And I will say it's really interesting that we're seeing that the tides are shifting It's not like China is copying the West anymore They're actually innovating on their own front Now one thing that I do want to talk about is of course the fact that R2 is coming We do know that Deep Seek's new Frontier models are going to be coming fairly soon And with this recent research paper I am wondering if that is going to be part of their next AI release We know that they are working super hard and some would argue that they're even ahead of recent Frontier Labs I mean a lot of people know about Meta's Llama 4 and how there have been many many different mishaps with regards to how that model has performed So Deepseek rushing to launch a new AI model and they are going to be going allin and this article was in February 2025 So I am wondering when this model is going to be released because we still do have many other open-source AIs like Quen coming soon and potentially even GPT5 and 03 Mini So it is super super important that Deepseek do maintain the momentum as it is quite hard to stay relevant in the AI space Now they do state that it is possible that they will be releasing the model as early as May because that was when their plan was They plan to release it in May but now they want it as early as possible So I wouldn't personally be surprised if we do get Deep Seek R2 potentially at the end of this month I mean it's a once in a-lifetime opportunity to put your company out even further ahead of OpenAI and steal the limelight once again I do think that this could potentially happen but of course we will have to see And individuals within the industry are saying that the launch of Deep Seek's R2 model could be a pivotal moment in the AI industry And I do wonder how this news of Deepseek's further advancements are going to affect companies like Meta and their open- source efforts Of course like I said before if you watched yesterday's video you'll know that there was a ton of controversy around the Llama 4 release with many critics arguing that they fixed the benchmarks and the model just simply isn't that good So it will be interesting to see exactly how that announcement of R2 when it is released affects the AI industry