Who owns the content? The issue with generative AI and copyright

Edit subtitles

0:00 - 0:03

Imagine you're chatting with your boss.
0:03 - 0:04

She asked you a question about an
0:04 - 0:07

approach to social media marketing.
0:07 - 0:10

You were just reading a book about this last
0:10 - 0:13

night, so the topic is fresh in your mind.
0:13 - 0:16

You're ready to answer, but what you say
0:16 - 0:19

is, "Sorry, my knowledge on this topic is
0:19 - 0:22

derived from a copyrighted work.
0:22 - 0:24

Here is a link to Amazon where you
0:24 - 0:26

can buy a copy of it to read yourself."
0:26 - 0:29

That would be weird, right? Your boss
0:29 - 0:31

probably wouldn't be too pleased, and
0:31 - 0:33

that's not the way the accumulation of
0:33 - 0:36

knowledge works for humans. But in the
0:36 - 0:39

debate about AI, a tool that as our
0:39 - 0:41

co-pilot is being thought about as an
0:41 - 0:44

assistant or a colleague in our work,
0:44 - 0:47

that is one potential road we may go
0:47 - 0:50

down because of our copyright laws and
0:50 - 0:53

because of the assertions
of copyright holders.
0:53 - 0:55

In this video, I want to dig into the
0:55 - 0:58

issue of training AI, and how should
0:58 - 1:00

the ownership of the material that's
1:00 - 1:03

used for training be thought about,
1:03 - 1:06

managed, and compensated. Since the
1:06 - 1:09

introduction of ChatGPT last year, the
1:09 - 1:11

issue of ownership of training data for
1:11 - 1:14

AI models has come up again and again.
1:14 - 1:16

We have seen various efforts through which
1:16 - 1:18

content owners, whether they be
1:18 - 1:21

well-known authors or celebrities,
1:21 - 1:24

artists, even Reddit and the New York
1:24 - 1:27

Times have tried to put up roadblocks—
1:27 - 1:29

legal and otherwise—to having their
1:29 - 1:32

content consumed by AI companies in the
1:32 - 1:34

training of their models.
1:34 - 1:36

There are good reasons for us to
1:36 - 1:38

be concerned about what data our AI
1:38 - 1:41

companions are trained on. A training
1:41 - 1:44

dataset consisting of
only the New York Times
1:44 - 1:48

versus only Reddit would clearly
1:48 - 1:49

lead to some very different
1:49 - 1:51

understandings around topics of
1:51 - 1:53

importance to our world. Unless the data
1:53 - 1:55

demonstrates a broad range of the
1:55 - 1:57

experiences we have and the concepts
1:57 - 2:00

we've come up with as humans, AI will be
2:00 - 2:02

limited in its ability to help us in a
2:02 - 2:04

relevant way. But the issue I'm talking
2:04 - 2:06

about today isn't really connected to
2:06 - 2:09

that quality issue, apart from my concern
2:09 - 2:12

that if rights holders to high-quality
2:12 - 2:14

work aim to have it extracted from our
2:14 - 2:17

AI models, what would be left? While there
2:17 - 2:19

have been various arguments against
2:19 - 2:22

using certain data for AI training, most
2:22 - 2:24

of them come back to one thing:
2:24 - 2:28

compensation. Those who assert ownership
2:28 - 2:30

rights over certain content that could
2:30 - 2:33

be useful to AI training datasets want
2:33 - 2:35

their slice of the pie that these
2:35 - 2:37

companies will generate with the AI
2:37 - 2:40

models they have trained. And given that
2:40 - 2:42

the AI economy could be worth up to
2:42 - 2:47

13 trillion dollars by 2030,
according to McKenzie—
2:47 - 2:49

which, if it were a country in its own
2:49 - 2:51

right, would be the world's third-largest
2:51 - 2:55

economy based on 2022 data—
who can blame
2:55 - 2:58

them, right? Well, it makes perfect sense,
2:58 - 3:00

apart from the fact that in most most
3:00 - 3:03

cases, we don't license knowledge in this
3:03 - 3:05

way. A rights holder is completely
3:05 - 3:08

entitled to compensation when their work
3:08 - 3:11

is consumed, but they don't normally get
3:11 - 3:14

residuals based on derivative output,
3:14 - 3:16

except in situations where those
3:16 - 3:19

derivatives are substantially similar to
3:19 - 3:22

the original. Under copyright law, we have
3:22 - 3:24

concepts such as fair use, but we also
3:24 - 3:26

have thousands of years of human history
3:26 - 3:28

where we can see a line where our
3:28 - 3:31

knowledge today simply derivates from
3:31 - 3:33

those issues that have been the subject
3:33 - 3:35

of work in years gone by. Let's do a
3:35 - 3:38

little quiz. I'm going to describe
3:38 - 3:40

something, and you go ahead and think
3:40 - 3:43

about what I'm describing. Maybe write it
3:43 - 3:46

down, maybe call it out.
See if you can guess.
3:46 - 3:49

So, number one: I'm green, I'm kind
3:49 - 3:53

of slimy, and I can jump really well.
3:53 - 3:56

What do you think that is?
Well, if you thought
3:56 - 4:00

frog, then you'd be right. Number two: I'm
4:00 - 4:02

covered in black and white stripes,
4:02 - 4:05

I look kind of like a horse, and I live on
4:05 - 4:08

the African Plains.
What do you think it is?
4:08 - 4:11

Did you think zebra? Number three: I'm a
4:11 - 4:13

building on Pennsylvania Avenue in
4:13 - 4:15

Washington D.C., where the
4:15 - 4:18

President of the United States lives.
4:18 - 4:20

Did you think White House?
4:20 - 4:23

But assuming you can get some or all of
4:23 - 4:26

those correct, can you tell me how you
4:26 - 4:28

know these things? Did you learn them
4:28 - 4:31

from a copyrighted work? Did you see them
4:31 - 4:33

in a copyrighted picture? Has the rights
4:33 - 4:36

holder given you permission or licensed
4:36 - 4:38

that information in such a way to allow
4:38 - 4:41

you to use it for your
benefit in daily life?
4:41 - 4:43

Think about everything else you know.
4:43 - 4:45

Are you sure that in your last
4:45 - 4:47

meeting at work, you didn't inadvertently
4:47 - 4:49

regurgitate something you read in a blog
4:49 - 4:52

for which you're not the rights holder?
4:52 - 4:53

If you excluded every piece of
4:53 - 4:55

information you've ever consumed that
4:55 - 4:58

was delivered by a book, or an article, or
4:58 - 5:01

a photo, or a video that may be owned by
5:01 - 5:05

someone else, how much
would you know now?
5:05 - 5:08

It's tricky, right? This is a knot that's
5:08 - 5:11

very hard to untie. While I can put
5:11 - 5:13

together a list of specific sources for
5:13 - 5:16

a lot of my content, there's a lot of the
5:16 - 5:18

flavor of what I talk about that's
5:18 - 5:20

simply built on years of knowledge.
5:20 - 5:23

I have no idea why I know many of the
5:23 - 5:25

things I know, and there certainly isn't
5:25 - 5:27

a demarcation in my brain for
5:27 - 5:30

copyrighted and non-copyrighted material.
5:30 - 5:32

But you're probably thinking, why is this
5:32 - 5:35

relevant to the issue of AI?
5:35 - 5:38

Well, while AI doesn't learn in the
5:38 - 5:40

same way humans do, its accumulation of
5:40 - 5:42

knowledge is in some ways remarkably
5:42 - 5:45

similar to ours. When we come across
5:45 - 5:48

someone who seems to know a lot and can
5:48 - 5:50

reference a lot of information, we will
5:50 - 5:53

sometimes describe them as being
5:53 - 5:58

well-read. In some ways,
GPT4 is just the most
5:58 - 6:01

well-read intelligence on the planet—
6:01 - 6:03

it just happens to be an artificial one.
6:03 - 6:05

In general, when we describe a person as
6:05 - 6:08

well-read, it's as a result of them
6:08 - 6:10

showing off their knowledge, and it's a
6:10 - 6:12

compliment. I'm not sure whether there's
6:12 - 6:14

ever been a case where a human being
6:14 - 6:16

described as well-read has resulted in
6:16 - 6:18

them being questioned as to whether they
6:18 - 6:20

legally obtained the books they've been
6:20 - 6:23

reading, or had properly licensed the
6:23 - 6:25

knowledge they used.
Whether that well-read
6:25 - 6:28

person is a poor student or whether
6:28 - 6:31

they're a billionaire C.E.O., we just take
6:31 - 6:33

for granted that they have spent time
6:33 - 6:36

accumulating knowledge, and many of us
6:36 - 6:39

respect that, and we don't place a
6:39 - 6:41

knowledge tax on the benefits they've
6:41 - 6:43

accured from that knowledge, financial or
6:43 - 6:46

otherwise. In my opinion, by focusing on
6:46 - 6:49

the value generated with knowledge and
6:49 - 6:51

attempting to find a way of sharing in
6:51 - 6:54

it, many content rights holders are
6:54 - 6:57

inadvertently seeking to upend centuries
6:57 - 7:00

of progress in how we educate and better
7:00 - 7:03

our society that was initiated with the
7:03 - 7:06

development of the printing press. And if
7:06 - 7:09

any modern author or artist claims that
7:09 - 7:11

anything they have ever produced is
7:11 - 7:14

truly original and not developed through
7:14 - 7:16

standing on the shoulders of centuries
7:16 - 7:18

of accumulated knowledge and art in our
7:18 - 7:21

society, then they're
just kidding themselves.
7:21 - 7:23

But more than trying to upend
7:23 - 7:27

the status quo, it gets worse. Have you
7:27 - 7:29

heard of subliminal advertising? This is
7:29 - 7:32

the concept that if you insert a message
7:32 - 7:35

or idea in a piece of media, maybe a
7:35 - 7:38

video or an audio ad, for example, that is
7:38 - 7:41

below the threshold of human conscious
7:41 - 7:44

perception, that your subconscious will
7:44 - 7:47

still pick up on it. The efficacy of this
7:47 - 7:50

approach to communicate anything is
7:50 - 7:53

debated, but there are limitations on
7:53 - 7:55

this sort of messaging in certain
7:55 - 7:57

jurisdictions, because ultimately, as
7:57 - 7:59

humans, we don't particularly like the
7:59 - 8:02

idea of our subconscious brains being
8:02 - 8:05

gained by someone to their advantage.
8:05 - 8:07

Imagine if this approach were used in
8:07 - 8:10

the teaching materials used in schools.
8:10 - 8:13

Perhaps a textbook publisher takes
8:13 - 8:16

that page that taught you
that the White House
8:16 - 8:19

is the White House and secretly
8:19 - 8:21

inserts content that tries to convince
8:21 - 8:22

you that the structure is actually
8:22 - 8:25

Buckingham Palace in London. Maybe it
8:25 - 8:28

sends those particular textbooks to the
8:28 - 8:30

school districts that pay the least for
8:30 - 8:33

their materials, or it sends them to the
8:33 - 8:35

school districts that refresh their
8:35 - 8:37

materials less frequently than the
8:37 - 8:40

publisher would like. This ends up as a
8:40 - 8:43

penalty on knowledge based on the books
8:43 - 8:45

rights holders perception of some
8:45 - 8:47

financial slight or other. You have a
8:47 - 8:49

bunch of children out there in the world
8:49 - 8:50

whose knowledge has been intentionally
8:50 - 8:53

poisoned for commercial reasons, and as a
8:53 - 8:55

society, we have no idea what the
8:55 - 8:58

long-term implications of that might be.
8:58 - 9:01

The good news is that, as far as I know,
9:01 - 9:03

that isn't going on in schools. The bad
9:03 - 9:06

news is that some innovative individuals
9:06 - 9:08

with issues with AI are attempting to do
9:08 - 9:11

it for AI model training by similarly
9:11 - 9:13

poisoning the digital images they
9:13 - 9:15

publish on the internet. This technology
9:15 - 9:18

advance called Nightshade was released
9:18 - 9:21

as an open-source poison pill by
9:21 - 9:23

researchers at the University of Chicago
9:23 - 9:26

back in October. Their goal is to give
9:26 - 9:28

visual artists and image publishers a
9:28 - 9:31

tool to protect their work by corrupting
9:31 - 9:34

the AI data set with images that show
9:34 - 9:36

one thing, but the AI has learned to show
9:36 - 9:39

something else. But imagine if the
9:39 - 9:41

University of Chicago sanctioned
9:41 - 9:43

research that helped those textbook
9:43 - 9:45

publishers protect their intellectual
9:45 - 9:48

property by teaching a generation of
9:48 - 9:50

school children whose chool boards have
9:50 - 9:52

not approved new book purchases that
9:52 - 9:55

zebras were actually frogs.
9:55 - 9:57

We would be outraged, and we should
9:57 - 10:00

be outraged by this research, as we have
10:00 - 10:03

absolutely no idea how far the long-term
10:03 - 10:05

consequences of such actions will go.
10:05 - 10:07

Now, I'm certainly not saying that rights
10:07 - 10:10

holders shouldn't be compensated. They
10:10 - 10:12

definitely should. And when we talk about
10:12 - 10:14

rights holders, we aren't just talking
10:14 - 10:17

about giant, wealthy corporations, we're
10:17 - 10:19

also talking about individuals, we're
10:19 - 10:21

talking about small businesses. I'm not
10:21 - 10:23

arguing that anyone should be made
10:23 - 10:26

worse-off by AI, and
it's really important that
10:26 - 10:30

AI tools still provide us
information like citations
10:30 - 10:32

to make it really easy to go off and
10:32 - 10:36

purchase or read the work if we need to.
10:36 - 10:39

Anyone who currently makes their living,
10:39 - 10:40

or even part of their living from
10:40 - 10:43

creating copyrighted work should be able
10:43 - 10:46

to have a road map for the future with
10:46 - 10:49

the inclusion of AI that allows them to
10:49 - 10:52

continue being successful. But AI also
10:52 - 10:56

shouldn't be some licensing gravy train
10:56 - 10:58

that takes authors or artists who are
10:58 - 11:01

making meager incomes from their work and
11:01 - 11:03

rewards them more handsomely on the
11:03 - 11:05

off-chance that it was their work that was
11:05 - 11:08

critical to some answer that ChatGPT
11:08 - 11:10

just gave. Just because companies like
11:10 - 11:13

Microsoft and Google have deep pockets
11:13 - 11:16

shouldn't radically change how we reward
11:16 - 11:18

and compensate those who create
11:18 - 11:21

knowledge in balance with the overall
11:21 - 11:23

societal good of sharing it. Now, it's
11:23 - 11:25

important to point out that there have
11:25 - 11:27

been some elements reported around the
11:27 - 11:30

training of AI models that suggests that
11:30 - 11:34

outright stolen works have been used.
11:34 - 11:36

This should be a line that isn't crossed,
11:36 - 11:39

and it is, and should continue to be
11:39 - 11:42

illegal. AI companies don't get to opt
11:42 - 11:43

out of the knowledge economy that
11:43 - 11:46

already exists, but their use should be
11:46 - 11:49

able to operate within existing guard
11:49 - 11:51

rails without too much change, and
11:51 - 11:53

without radically altering anyone's
11:53 - 11:55

protections. If you choose to publish
11:55 - 11:57

your content on the open internet, you
11:57 - 11:59

probably do so because it gives you some
11:59 - 12:02

advantage in life or business.
For example,
12:02 - 12:05

I publish here on YouTube, as it
12:05 - 12:07

helps me participate in the wider tech
12:07 - 12:10

community. I sometimes get clients from
12:10 - 12:12

people who watch my videos, and I gain
12:12 - 12:15

connections that are
relevant to my interests.
12:15 - 12:17

It's for me to assess whether
12:17 - 12:19

publishing on YouTube, which is open to
12:19 - 12:22

everyone, is what I want to do, but once
12:22 - 12:24

I've made the decision to put content
12:24 - 12:26

here rather than behind a paywall
12:26 - 12:29

somewhere else, I don't get to decide in
12:29 - 12:32

any substantive terms who gets to see
12:32 - 12:33

it and who gets to distill the
12:33 - 12:35

information I share. But there are
12:35 - 12:38

different paths. With my new book,
12:38 - 12:40

"Who's in the Co-Pilot Seat?", I've made a
12:40 - 12:42

different business decision. I've
12:42 - 12:45

published that book for commercial sale
12:45 - 12:46

and if someone wishes to have a copy,
12:46 - 12:50

they need to buy it. But again, once they
12:50 - 12:52

bought or otherwise legally obtained a
12:52 - 12:54

copy, if they use the information that
12:54 - 12:57

book contains to their benefit, including
12:57 - 12:59

their financial benefit—
12:59 - 13:01

as long as they don't infringe my
13:01 - 13:03

copyright by regurgitating my content
13:03 - 13:06

verbaitm—I don't get a say. And talking
13:06 - 13:09

of paywalls, these are odd things.
13:09 - 13:12

A while ago, ChatGPT had to remove its web
13:12 - 13:13

browsing function because it was
13:13 - 13:15

defeating pay walls and presenting
13:15 - 13:18

non-public information. However, to my
13:18 - 13:20

understanding, it's not like the AI was
13:20 - 13:23

seeing the paywall and hacking its way in.
13:23 - 13:25

The way that paywalls are
13:25 - 13:27

implemented is rather fragile in lots of
13:27 - 13:30

cases. Some websites publish the full
13:30 - 13:32

text for search engine indexing to get
13:32 - 13:34

up the Google rankings. Others allow a
13:34 - 13:36

certain number of articles to be read
13:36 - 13:39

before the paywall kicks in.
In some cases,
13:39 - 13:42

paywalls are really just presentational
13:42 - 13:44

elements on a page, rather than something
13:44 - 13:46

clever going on in the background of a
13:46 - 13:49

website. The point is that, in the cases
13:49 - 13:51

I'm familiar with, an AI's ability to
13:51 - 13:54

access paywalled content has come down
13:54 - 13:56

more to how the publisher has chosen to
13:56 - 13:58

publish their content in order to drive
13:58 - 14:00

the most traffic to it, rather than
14:00 - 14:03

some malicious act on the part of the AI
14:03 - 14:06

vendor. Perhaps if I left copies of my
14:06 - 14:08

new book on tables at coffee shops
14:08 - 14:11

across the city, I would sell more copies
14:11 - 14:15

or get more clients. And maybe not, but we
14:15 - 14:17

wouldn't consider anyone who picked up
14:17 - 14:19

one of those books and read it to be
14:19 - 14:22

stealing my content. The fact is that in
14:22 - 14:25

the online world, publishers often do one
14:25 - 14:27

thing with their content and then lean
14:27 - 14:30

on their terms of use to regulate it.
14:30 - 14:33

Where if you compared to a largely similar
14:33 - 14:36

approach in the physical world, it simply
14:36 - 14:38

wouldn't stand up to scrutiny. I have no
14:38 - 14:40

business in what any consumer of my
14:40 - 14:42

content gets out of it, other than
14:42 - 14:44

wanting to make the best content. If you
14:44 - 14:47

take what I say, or write and turn it
14:47 - 14:49

into a business 10 times the size of
14:49 - 14:52

mine, good for you. I have no remedy
14:52 - 14:53

further down the road to come knocking
14:53 - 14:56

on your door to say, "Actually, when I
14:56 - 14:58

charge you $12 for my book, I really meant
14:58 - 15:01

$12 million, because you've done so well."
15:01 - 15:02

That just isn't the way the
15:02 - 15:04

transfer of knowledge works in our
15:04 - 15:06

society, and creating a road for it to
15:06 - 15:08

become the way the transfer of knowledge
15:08 - 15:11

works, in my opinion, is a very dangerous
15:11 - 15:13

path. For published works that aren't
15:13 - 15:15

accessible on the open internet, there is
15:15 - 15:17

already a perfectly suitable
15:17 - 15:20

non-copyright infringing model that
15:20 - 15:22

could work today for AI training: that's
15:22 - 15:25

the public library. Libraries buy books
15:25 - 15:27

and other materials and are able to
15:27 - 15:29

share them widely in the pursuit of
15:29 - 15:32

maximizing society's knowledge. We see
15:32 - 15:35

libraries and their content as a public
15:35 - 15:37

good, and authors are pleased to see
15:37 - 15:39

their work in libraries for the most
15:39 - 15:41

part, rather than seeing it as theft.
15:41 - 15:44

Libraries are able to operate because of
15:44 - 15:46

both legal protections and licensing
15:46 - 15:49

agreements, and many are transitioned
15:49 - 15:51

into a more digital world where their
15:51 - 15:53

content isn't solely books on the shelves.
15:53 - 15:55

The concept of a library, both as
15:55 - 15:58

a source of knowledge to turn a
15:58 - 16:01

non-reader into someone who is well-read,
16:01 - 16:03

and a model for how protected content
16:03 - 16:05

can be shared without harming copyright
16:05 - 16:07

holders is, in my opinion, the most
16:07 - 16:10

readily relevant to the issue of both AI
16:10 - 16:14

training and AI model use. After all, if
16:14 - 16:16

someone who becomes a tech billionaire
16:16 - 16:17

shares a life story that involves them
16:17 - 16:19

starting with no books at home and
16:19 - 16:22

relying on the public library to allow
16:22 - 16:23

them to break barriers of social
16:23 - 16:26

mobility, they are lorded as an example
16:26 - 16:29

to us, not derided as a thief. Companies
16:29 - 16:33

like Microsoft, like Google, like open-AI,
16:33 - 16:35

have the resources to build the fullest
16:35 - 16:37

libraries known to our civilization, and
16:37 - 16:39

that is what they should be encouraged
16:39 - 16:41

to do to train the tools that will be
16:41 - 16:43

the next chapter of our planet. And those
16:43 - 16:45

who are trying to poison the well of
16:45 - 16:47

knowledge should not be held up as
16:47 - 16:50

warriors for the rights of creators, but
16:50 - 16:52

as troublemakers against one of the
16:52 - 16:55

pillars of the betterment of society: our
16:55 - 16:57

ability to build on top of the knowledge
16:57 - 17:00

and creativity that has come before us.
17:00 - 17:02

Ultimately, this is something from which
17:02 - 17:05

we all benefit, and in my opinion, solving
17:05 - 17:07

this problem is a simple as opening some
17:07 - 17:09

new libraries that are as relevant to
17:09 - 17:12

the AI challenges of today as they were
17:12 - 17:15

to the literacy challenges
of years gone by.
17:15 - 17:17

It might require tweaking some laws
17:17 - 17:20

or tweaking how some things are licensed,
17:20 - 17:23

but in my opinion, the
model is already there.
17:23 - 17:25

What do you think? Have I got this
17:25 - 17:27

issue confused, or should we be looking
17:27 - 17:29

for a simpler and more universally
17:29 - 17:32

beneficial solution to what has become a
17:32 - 17:34

very complex question? Let me know down
17:34 - 17:36

in the comments. Thanks for watching
17:36 - 17:37

through to the end. I hope this was
17:37 - 17:40

useful to you. Until the next video.
17:40 - 17:41

Bye-bye.
17:41 - 17:45

[Music]
17:55 - 17:58

bye-bye

Title:: Who owns the content? The issue with generative AI and copyright
Description:: more » « less
Video Language:: English
Duration:: 17:56

	OEVIDEOS edited English subtitles for Who owns the content? The issue with generative AI and copyright
	OEVIDEOS edited English subtitles for Who owns the content? The issue with generative AI and copyright

English subtitles

Revisions Compare revisions

Revision 2 Edited

OEVIDEOS
Revision 1 Uploaded

OEVIDEOS

	Revision Number	Author	Created
	2	OEVIDEOS
	1	OEVIDEOS

Who owns the content? The issue with generative AI and copyright

Revisions Compare revisions

Our website uses cookies

Operating cookies (Required)