hide🎇 May 15th marked Global Accessibility Awareness Day! 💖 Accessibility is at the heart of everything we do at Amara.
Explore our latest post on why Captions and Subtitles are essential for Accessibility in Education!

< Return to Video

Deepseeks Self Learning "Breakthrough" Is Incredible (Deepseek R2 News)

  • 0:00 - 0:02
    so are Deepseek creating a
  • 0:02 - 0:04
    self-improving AI well that's the latest
  • 0:04 - 0:06
    claim coming out of this newspaper and
  • 0:06 - 0:08
    they're talking about the fact that last
  • 0:08 - 0:10
    Friday Deepseek released a paper for
  • 0:10 - 0:13
    self-improving AI models Now there are a
  • 0:13 - 0:15
    little bit of caveats to this I mean
  • 0:15 - 0:17
    there is some truth statements to this
  • 0:17 - 0:19
    but I'm going to dive into how this all
  • 0:19 - 0:21
    works and the truth about this claim So
  • 0:21 - 0:23
    you can see right here that the article
  • 0:23 - 0:25
    starts by saying "China's Deepseek finds
  • 0:25 - 0:28
    a way to help AI get better at answering
  • 0:28 - 0:31
    questions Here's how it works So this
  • 0:31 - 0:34
    actually got the Twitter sphere in a
  • 0:34 - 0:36
    hold because so many people were
  • 0:36 - 0:39
    wondering exactly what on earth this is
  • 0:39 - 0:41
    and how on earth DeepS are actually
  • 0:41 - 0:43
    changing the game when it comes to
  • 0:43 - 0:46
    having an AI improve itself Now we all
  • 0:46 - 0:48
    know that Deepseek are an open research
  • 0:48 - 0:51
    company or open source company and of
  • 0:51 - 0:53
    course that means that they publish
  • 0:53 - 0:56
    their research So right here you can see
  • 0:56 - 0:59
    exactly the research that was published
  • 0:59 - 1:01
    and it talks about inference time
  • 1:01 - 1:04
    scaling for generous reward modeling Now
  • 1:04 - 1:05
    trust me guys I'm not going to bore you
  • 1:05 - 1:07
    with the specifics of everything I'll
  • 1:07 - 1:09
    actually explain this to you in a way
  • 1:09 - 1:11
    that you can understand But the results
  • 1:11 - 1:14
    that we are seeing is that yes over time
  • 1:14 - 1:18
    with inference time scaling the model
  • 1:18 - 1:20
    does improve So what we can see here and
  • 1:20 - 1:22
    trust me guys it's actually not that
  • 1:22 - 1:24
    hard to understand is that every time
  • 1:24 - 1:26
    you sample the model we show how
  • 1:26 - 1:29
    accurate that rating becomes So let's
  • 1:29 - 1:31
    say for example we were asking the AI
  • 1:31 - 1:33
    assistants which are the reward models
  • 1:33 - 1:35
    in this case to rate how good this AI's
  • 1:35 - 1:37
    response was The graph here is showing
  • 1:37 - 1:40
    how accurate these results get over time
  • 1:40 - 1:42
    The numbers here show the AI's
  • 1:42 - 1:43
    performance and the numbers here show
  • 1:43 - 1:45
    how many times it tried Now this isn't
  • 1:45 - 1:47
    the only graph that shows you just how
  • 1:47 - 1:50
    incredible this is because I think it's
  • 1:50 - 1:52
    really important that they're also
  • 1:52 - 1:54
    showing us the other models here And
  • 1:54 - 1:57
    they show GPT4 here which is really
  • 1:57 - 1:59
    interesting Although I will say I'm not
  • 1:59 - 2:01
    sure which version of GPT40 this is as
  • 2:01 - 2:03
    there are multiple different versions of
  • 2:03 - 2:07
    GPT40 that currently do exist So this
  • 2:07 - 2:09
    one outperforming it with inference time
  • 2:09 - 2:11
    is a little bit of a caveat because
  • 2:11 - 2:14
    GPT40 isn't actually an inference time
  • 2:14 - 2:16
    model I'm guessing they're just showing
  • 2:16 - 2:18
    just how good the model gets in terms of
  • 2:18 - 2:20
    overall performance So actually we'll
  • 2:20 - 2:22
    explain to you guys how this works
  • 2:22 - 2:24
    because I think this probably is going
  • 2:24 - 2:27
    to be used in the base of the next model
  • 2:27 - 2:29
    Deepseek R2 which is probably going to
  • 2:29 - 2:32
    be Deep Seek's next Frontier model that
  • 2:32 - 2:34
    could outperform the next level of AI So
  • 2:34 - 2:37
    how this entire thing works and I'm just
  • 2:37 - 2:39
    going to break this down for you guys is
  • 2:39 - 2:41
    the goal here is to get the AI to
  • 2:41 - 2:43
    improve Let's just say in this example
  • 2:43 - 2:46
    chat GBT So now if they want to get the
  • 2:46 - 2:48
    AI to improve is to train it to have
  • 2:48 - 2:51
    another AI actually judge how good the
  • 2:51 - 2:54
    answers are And this judge is basically
  • 2:54 - 2:57
    called a reward model And this AI paper
  • 2:57 - 2:59
    basically wants to create a very good
  • 2:59 - 3:02
    versatile judge Now there's a few
  • 3:02 - 3:04
    problems with current AI judges And
  • 3:04 - 3:06
    number one is that they're not general
  • 3:06 - 3:08
    enough They might be good at judging
  • 3:08 - 3:11
    math answers but when you know you judge
  • 3:11 - 3:14
    creative answers or vice versa they just
  • 3:14 - 3:15
    don't perform that well And now another
  • 3:15 - 3:18
    thing as well is that these AI judges
  • 3:18 - 3:21
    they don't improve that much on the spot
  • 3:21 - 3:23
    So giving them more computer power with
  • 3:23 - 3:25
    you know inference time doesn't
  • 3:25 - 3:27
    necessarily make their judgment much
  • 3:27 - 3:30
    better They actually improve most times
  • 3:30 - 3:33
    after training And so because we have
  • 3:33 - 3:35
    these ingrained issues into the judges
  • 3:35 - 3:37
    when we're trying to improve the model's
  • 3:37 - 3:40
    responses with these reward models we of
  • 3:40 - 3:42
    course have to come up with a better
  • 3:42 - 3:44
    solution And that's where Deepseek
  • 3:44 - 3:47
    solution comes into the picture This is
  • 3:47 - 3:50
    where we get the GRM judge So the paper
  • 3:50 - 3:53
    solution is called Deepseek GRM Now this
  • 3:53 - 3:55
    is a different kind of judge Instead of
  • 3:55 - 3:57
    just outputting a score like seven out
  • 3:57 - 3:59
    of 10 their judge writes out the
  • 3:59 - 4:02
    reasoning and then it explains why an
  • 4:02 - 4:05
    answer is good or bad based on specific
  • 4:05 - 4:07
    principles and these are going to be the
  • 4:07 - 4:08
    rules that it manages to come up with
  • 4:08 - 4:11
    Then from that reasoning the score is
  • 4:11 - 4:13
    then extracted Now the reason that
  • 4:13 - 4:14
    they've chosen to do this is because
  • 4:14 - 4:17
    this is more flexible It's more detailed
  • 4:17 - 4:18
    and more importantly if you ask it to
  • 4:18 - 4:21
    judge the same thing multiple times it
  • 4:21 - 4:23
    might write multiple slightly different
  • 4:23 - 4:25
    reasons or principles leading to
  • 4:25 - 4:27
    slightly different scores And so what
  • 4:27 - 4:29
    they actually use is they train this
  • 4:29 - 4:32
    judge with SPCT So this is where they
  • 4:32 - 4:34
    use reinforcement learning which is
  • 4:34 - 4:36
    where you know you can like train an AI
  • 4:36 - 4:37
    to play a game which is like training an
  • 4:37 - 4:40
    AI to play a game The judge AI practices
  • 4:40 - 4:42
    generating principles and critiques and
  • 4:42 - 4:44
    its final judgment is based on if it
  • 4:44 - 4:46
    matches the correct judgment and then of
  • 4:46 - 4:48
    course it gets a reward and it learns to
  • 4:48 - 4:50
    do that more So over time they're
  • 4:50 - 4:53
    reinforcing the good behavior from this
  • 4:53 - 4:54
    model And the reason they're doing this
  • 4:54 - 4:57
    is because this teaches the judge AI to
  • 4:57 - 5:00
    generate good principles and critiques
  • 5:00 - 5:02
    that lead to accurate judgments that
  • 5:02 - 5:04
    over the time make it smarter Now of
  • 5:04 - 5:07
    course making it better on the spot is
  • 5:07 - 5:10
    where we have the inference scaling So
  • 5:10 - 5:12
    you know that famous famous paradigm
  • 5:12 - 5:14
    that everyone is talking about is once
  • 5:14 - 5:17
    again making headwinds again So what
  • 5:17 - 5:19
    they do here is they ask it multiple
  • 5:19 - 5:21
    times which is called sampling So when
  • 5:21 - 5:24
    they want a judge over here to get an
  • 5:24 - 5:26
    answer they ask their trained judge
  • 5:26 - 5:28
    multiple times So they'll ask it eight
  • 5:28 - 5:30
    or 32 times Then what they'll do is
  • 5:30 - 5:32
    they'll then combine those judgments
  • 5:32 - 5:34
    which is where they'll have voting So
  • 5:34 - 5:36
    they'll collect all the scores from the
  • 5:36 - 5:39
    multiple tries and combine them into you
  • 5:39 - 5:41
    know I guess you could say an average
  • 5:41 - 5:42
    And all of this is happening at
  • 5:42 - 5:44
    inference time They then have smart
  • 5:44 - 5:46
    combining which is where they trained
  • 5:46 - 5:50
    another tiny AI called a metarm whose
  • 5:50 - 5:52
    only job is to quickly rate how good
  • 5:52 - 5:54
    each written critique from the main
  • 5:54 - 5:56
    judge was Then they combine the scores
  • 5:56 - 5:58
    only from the critiques that the Metar
  • 5:58 - 6:00
    RM thought were good And this is because
  • 6:00 - 6:02
    asking multiple times and combining
  • 6:02 - 6:04
    results especially the meta RM helper
  • 6:04 - 6:06
    makes the final judgment much more
  • 6:06 - 6:08
    reliable than just asking once And this
  • 6:08 - 6:11
    is much more compute inensive but it
  • 6:11 - 6:13
    does give a better result Now overall
  • 6:13 - 6:15
    you might be thinking all right that is
  • 6:15 - 6:17
    pretty interesting but what are the
  • 6:17 - 6:18
    results here so the results are that
  • 6:18 - 6:21
    this of course performs really well The
  • 6:21 - 6:24
    AI judge works very well across many
  • 6:24 - 6:26
    different tasks The ask multiple times
  • 6:26 - 6:29
    strategy dramatically improves the AI's
  • 6:29 - 6:32
    accuracy and the more times they ask it
  • 6:32 - 6:34
    or the more compute they use the better
  • 6:34 - 6:37
    it gets And their judge AI even though
  • 6:37 - 6:40
    medium-sized could outperform much
  • 6:40 - 6:42
    larger AIs like
  • 6:42 - 6:44
    GPT40 used as a judge And if those
  • 6:44 - 6:47
    larger AIs were only asked once so
  • 6:47 - 6:49
    overall it's looking really good And
  • 6:49 - 6:51
    using the meta RM that tiny AI that
  • 6:51 - 6:53
    little helper to pick out the best
  • 6:53 - 6:55
    critiques works even better than simple
  • 6:55 - 6:57
    voting They basically built a really
  • 6:57 - 6:58
    smart AI judge that explains its
  • 6:58 - 7:00
    reasoning They trained it in a really
  • 7:00 - 7:02
    clever way And crucially this judge gets
  • 7:02 - 7:04
    better if you let it think multiple
  • 7:04 - 7:06
    times which is just the same as using
  • 7:06 - 7:08
    more compute And this allows a
  • 7:08 - 7:10
    moderately sized AI judge to achieve top
  • 7:10 - 7:13
    performance when needed So overall this
  • 7:13 - 7:16
    is once again something really
  • 7:16 - 7:17
    interesting because we know that
  • 7:17 - 7:19
    Deepseek are continuing to push the
  • 7:19 - 7:21
    frontier in terms of innovation And I
  • 7:21 - 7:23
    will say it's really interesting that
  • 7:23 - 7:26
    we're seeing that the tides are shifting
  • 7:26 - 7:28
    It's not like China is copying the West
  • 7:28 - 7:30
    anymore They're actually innovating on
  • 7:30 - 7:32
    their own front Now one thing that I do
  • 7:32 - 7:34
    want to talk about is of course the fact
  • 7:34 - 7:37
    that R2 is coming We do know that Deep
  • 7:37 - 7:39
    Seek's new Frontier models are going to
  • 7:39 - 7:43
    be coming fairly soon And with this
  • 7:43 - 7:46
    recent research paper I am wondering if
  • 7:46 - 7:48
    that is going to be part of their next
  • 7:48 - 7:50
    AI release We know that they are working
  • 7:50 - 7:52
    super hard and some would argue that
  • 7:52 - 7:54
    they're even ahead of recent Frontier
  • 7:54 - 7:57
    Labs I mean a lot of people know about
  • 7:57 - 8:00
    Meta's Llama 4 and how there have been
  • 8:00 - 8:02
    many many different mishaps with regards
  • 8:02 - 8:05
    to how that model has performed So
  • 8:05 - 8:08
    Deepseek rushing to launch a new AI
  • 8:08 - 8:09
    model and they are going to be going
  • 8:09 - 8:12
    allin and this article was in February
  • 8:12 - 8:15
    2025 So I am wondering when this model
  • 8:15 - 8:17
    is going to be released because we still
  • 8:17 - 8:20
    do have many other open-source AIs like
  • 8:20 - 8:22
    Quen coming soon and potentially even
  • 8:22 - 8:26
    GPT5 and 03 Mini So it is super super
  • 8:26 - 8:28
    important that Deepseek do maintain the
  • 8:28 - 8:30
    momentum as it is quite hard to stay
  • 8:30 - 8:33
    relevant in the AI space Now they do
  • 8:33 - 8:36
    state that it is possible that they will
  • 8:36 - 8:39
    be releasing the model as early as May
  • 8:39 - 8:41
    because that was when their plan was
  • 8:41 - 8:44
    They plan to release it in May but now
  • 8:44 - 8:46
    they want it as early as possible So I
  • 8:46 - 8:48
    wouldn't personally be surprised if we
  • 8:48 - 8:51
    do get Deep Seek R2 potentially at the
  • 8:51 - 8:53
    end of this month I mean it's a once in
  • 8:53 - 8:55
    a-lifetime opportunity to put your
  • 8:55 - 8:58
    company out even further ahead of OpenAI
  • 8:58 - 9:01
    and steal the limelight once again I do
  • 9:01 - 9:02
    think that this could potentially happen
  • 9:02 - 9:04
    but of course we will have to see And
  • 9:04 - 9:06
    individuals within the industry are
  • 9:06 - 9:08
    saying that the launch of Deep Seek's R2
  • 9:08 - 9:10
    model could be a pivotal moment in the
  • 9:10 - 9:13
    AI industry And I do wonder how this
  • 9:13 - 9:15
    news of Deepseek's further advancements
  • 9:15 - 9:17
    are going to affect companies like Meta
  • 9:17 - 9:19
    and their open- source efforts Of course
  • 9:19 - 9:20
    like I said before if you watched
  • 9:20 - 9:22
    yesterday's video you'll know that there
  • 9:22 - 9:25
    was a ton of controversy around the
  • 9:25 - 9:27
    Llama 4 release with many critics
  • 9:27 - 9:29
    arguing that they fixed the benchmarks
  • 9:29 - 9:31
    and the model just simply isn't that
  • 9:31 - 9:33
    good So it will be interesting to see
  • 9:33 - 9:36
    exactly how that announcement of R2 when
  • 9:36 - 9:39
    it is released affects the AI industry
Title:
Deepseeks Self Learning "Breakthrough" Is Incredible (Deepseek R2 News)
Description:

more » « less
Video Language:
English
Duration:
09:38

English subtitles

Incomplete

Revisions