-
Imagine you're chatting with your boss.
-
She asked you a question about an
-
approach to social media marketing.
-
You were just reading a book about this last
-
night, so the topic is fresh in your mind.
-
You're ready to answer, but what you say
-
is, "Sorry, my knowledge on this topic is
-
derived from a copyrighted work.
-
Here is a link to Amazon where you
-
can buy a copy of it to read yourself."
-
That would be weird, right? Your boss
-
probably wouldn't be too pleased, and
-
that's not the way the accumulation of
-
knowledge works for humans. But in the
-
debate about AI, a tool that as our
-
co-pilot is being thought about as an
-
assistant or a colleague in our work,
-
that is one potential road we may go
-
down because of our copyright laws and
-
because of the assertions
of copyright holders.
-
In this video, I want to dig into the
-
issue of training AI, and how should
-
the ownership of the material that's
-
used for training be thought about,
-
managed, and compensated. Since the
-
introduction of ChatGPT last year, the
-
issue of ownership of training data for
-
AI models has come up again and again.
-
We have seen various efforts through which
-
content owners, whether they be
-
well-known authors or celebrities,
-
artists, even Reddit and the New York
-
Times have tried to put up roadblocks—
-
legal and otherwise—to having their
-
content consumed by AI companies in the
-
training of their models.
-
There are good reasons for us to
-
be concerned about what data our AI
-
companions are trained on. A training
-
dataset consisting of
only the New York Times
-
versus only Reddit would clearly
-
lead to some very different
-
understandings around topics of
-
importance to our world. Unless the data
-
demonstrates a broad range of the
-
experiences we have and the concepts
-
we've come up with as humans, AI will be
-
limited in its ability to help us in a
-
relevant way. But the issue I'm talking
-
about today isn't really connected to
-
that quality issue, apart from my concern
-
that if rights holders to high-quality
-
work aim to have it extracted from our
-
AI models, what would be left? While there
-
have been various arguments against
-
using certain data for AI training, most
-
of them come back to one thing:
-
compensation. Those who assert ownership
-
rights over certain content that could
-
be useful to AI training datasets want
-
their slice of the pie that these
-
companies will generate with the AI
-
models they have trained. And given that
-
the AI economy could be worth up to
-
13 trillion dollars by 2030,
according to McKenzie—
-
which, if it were a country in its own
-
right, would be the world's third-largest
-
economy based on 2022 data—
who can blame
-
them, right? Well, it makes perfect sense,
-
apart from the fact that in most most
-
cases, we don't license knowledge in this
-
way. A rights holder is completely
-
entitled to compensation when their work
-
is consumed, but they don't normally get
-
residuals based on derivative output,
-
except in situations where those
-
derivatives are substantially similar to
-
the original. Under copyright law, we have
-
concepts such as fair use, but we also
-
have thousands of years of human history
-
where we can see a line where our
-
knowledge today simply derivates from
-
those issues that have been the subject
-
of work in years gone by. Let's do a
-
little quiz. I'm going to describe
-
something, and you go ahead and think
-
about what I'm describing. Maybe write it
-
down, maybe call it out.
See if you can guess.
-
So, number one: I'm green, I'm kind
-
of slimy, and I can jump really well.
-
What do you think that is?
Well, if you thought
-
frog, then you'd be right. Number two: I'm
-
covered in black and white stripes,
-
I look kind of like a horse, and I live on
-
the African Plains.
What do you think it is?
-
Did you think zebra? Number three: I'm a
-
building on Pennsylvania Avenue in
-
Washington D.C., where the
-
President of the United States lives.
-
Did you think White House?
-
But assuming you can get some or all of
-
those correct, can you tell me how you
-
know these things? Did you learn them
-
from a copyrighted work? Did you see them
-
in a copyrighted picture? Has the rights
-
holder given you permission or licensed
-
that information in such a way to allow
-
you to use it for your
benefit in daily life?
-
Think about everything else you know.
-
Are you sure that in your last
-
meeting at work, you didn't inadvertently
-
regurgitate something you read in a blog
-
for which you're not the rights holder?
-
If you excluded every piece of
-
information you've ever consumed that
-
was delivered by a book, or an article, or
-
a photo, or a video that may be owned by
-
someone else, how much
would you know now?
-
It's tricky, right? This is a knot that's
-
very hard to untie. While I can put
-
together a list of specific sources for
-
a lot of my content, there's a lot of the
-
flavor of what I talk about that's
-
simply built on years of knowledge.
-
I have no idea why I know many of the
-
things I know, and there certainly isn't
-
a demarcation in my brain for
-
copyrighted and non-copyrighted material.
-
But you're probably thinking, why is this
-
relevant to the issue of AI?
-
Well, while AI doesn't learn in the
-
same way humans do, its accumulation of
-
knowledge is in some ways remarkably
-
similar to ours. When we come across
-
someone who seems to know a lot and can
-
reference a lot of information, we will
-
sometimes describe them as being
-
well-read. In some ways,
GPT4 is just the most
-
well-read intelligence on the planet—
-
it just happens to be an artificial one.
-
In general, when we describe a person as
-
well-read, it's as a result of them
-
showing off their knowledge, and it's a
-
compliment. I'm not sure whether there's
-
ever been a case where a human being
-
described as well-read has resulted in
-
them being questioned as to whether they
-
legally obtained the books they've been
-
reading, or had properly licensed the
-
knowledge they used.
Whether that well-read
-
person is a poor student or whether
-
they're a billionaire C.E.O., we just take
-
for granted that they have spent time
-
accumulating knowledge, and many of us
-
respect that, and we don't place a
-
knowledge tax on the benefits they've
-
accured from that knowledge, financial or
-
otherwise. In my opinion, by focusing on
-
the value generated with knowledge and
-
attempting to find a way of sharing in
-
it, many content rights holders are
-
inadvertently seeking to upend centuries
-
of progress in how we educate and better
-
our society that was initiated with the
-
development of the printing press. And if
-
any modern author or artist claims that
-
anything they have ever produced is
-
truly original and not developed through
-
standing on the shoulders of centuries
-
of accumulated knowledge and art in our
-
society, then they're
just kidding themselves.
-
But more than trying to upend
-
the status quo, it gets worse. Have you
-
heard of subliminal advertising? This is
-
the concept that if you insert a message
-
or idea in a piece of media, maybe a
-
video or an audio ad, for example, that is
-
below the threshold of human conscious
-
perception, that your subconscious will
-
still pick up on it. The efficacy of this
-
approach to communicate anything is
-
debated, but there are limitations on
-
this sort of messaging in certain
-
jurisdictions, because ultimately, as
-
humans, we don't particularly like the
-
idea of our subconscious brains being
-
gained by someone to their advantage.
-
Imagine if this approach were used in
-
the teaching materials used in schools.
-
Perhaps a textbook publisher takes
-
that page that taught you
that the White House
-
is the White House and secretly
-
inserts content that tries to convince
-
you that the structure is actually
-
Buckingham Palace in London. Maybe it
-
sends those particular textbooks to the
-
school districts that pay the least for
-
their materials, or it sends them to the
-
school districts that refresh their
-
materials less frequently than the
-
publisher would like. This ends up as a
-
penalty on knowledge based on the books
-
rights holders perception of some
-
financial slight or other. You have a
-
bunch of children out there in the world
-
whose knowledge has been intentionally
-
poisoned for commercial reasons, and as a
-
society, we have no idea what the
-
long-term implications of that might be.
-
The good news is that, as far as I know,
-
that isn't going on in schools. The bad
-
news is that some innovative individuals
-
with issues with AI are attempting to do
-
it for AI model training by similarly
-
poisoning the digital images they
-
publish on the internet. This technology
-
advance called Nightshade was released
-
as an open-source poison pill by
-
researchers at the University of Chicago
-
back in October. Their goal is to give
-
visual artists and image publishers a
-
tool to protect their work by corrupting
-
the AI data set with images that show
-
one thing, but the AI has learned to show
-
something else. But imagine if the
-
University of Chicago sanctioned
-
research that helped those textbook
-
publishers protect their intellectual
-
property by teaching a generation of
-
school children whose chool boards have
-
not approved new book purchases that
-
zebras were actually frogs.
-
We would be outraged, and we should
-
be outraged by this research, as we have
-
absolutely no idea how far the long-term
-
consequences of such actions will go.
-
Now, I'm certainly not saying that rights
-
holders shouldn't be compensated. They
-
definitely should. And when we talk about
-
rights holders, we aren't just talking
-
about giant, wealthy corporations, we're
-
also talking about individuals, we're
-
talking about small businesses. I'm not
-
arguing that anyone should be made
-
worse-off by AI, and
it's really important that
-
AI tools still provide us
information like citations
-
to make it really easy to go off and
-
purchase or read the work if we need to.
-
Anyone who currently makes their living,
-
or even part of their living from
-
creating copyrighted work should be able
-
to have a road map for the future with
-
the inclusion of AI that allows them to
-
continue being successful. But AI also
-
shouldn't be some licensing gravy train
-
that takes authors or artists who are
-
making meager incomes from their work and
-
rewards them more handsomely on the
-
off-chance that it was their work that was
-
critical to some answer that ChatGPT
-
just gave. Just because companies like
-
Microsoft and Google have deep pockets
-
shouldn't radically change how we reward
-
and compensate those who create
-
knowledge in balance with the overall
-
societal good of sharing it. Now, it's
-
important to point out that there have
-
been some elements reported around the
-
training of AI models that suggests that
-
outright stolen works have been used.
-
This should be a line that isn't crossed,
-
and it is, and should continue to be
-
illegal. AI companies don't get to opt
-
out of the knowledge economy that
-
already exists, but their use should be
-
able to operate within existing guard
-
rails without too much change, and
-
without radically altering anyone's
-
protections. If you choose to publish
-
your content on the open internet, you
-
probably do so because it gives you some
-
advantage in life or business.
For example,
-
I publish here on YouTube, as it
-
helps me participate in the wider tech
-
community. I sometimes get clients from
-
people who watch my videos, and I gain
-
connections that are
relevant to my interests.
-
It's for me to assess whether
-
publishing on YouTube, which is open to
-
everyone, is what I want to do, but once
-
I've made the decision to put content
-
here rather than behind a paywall
-
somewhere else, I don't get to decide in
-
any substantive terms who gets to see
-
it and who gets to distill the
-
information I share. But there are
-
different paths. With my new book,
-
"Who's in the Co-Pilot Seat?", I've made a
-
different business decision. I've
-
published that book for commercial sale
-
and if someone wishes to have a copy,
-
they need to buy it. But again, once they
-
bought or otherwise legally obtained a
-
copy, if they use the information that
-
book contains to their benefit, including
-
their financial benefit—
-
as long as they don't infringe my
-
copyright by regurgitating my content
-
verbaitm—I don't get a say. And talking
-
of paywalls, these are odd things.
-
A while ago, ChatGPT had to remove its web
-
browsing function because it was
-
defeating pay walls and presenting
-
non-public information. However, to my
-
understanding, it's not like the AI was
-
seeing the paywall and hacking its way in.
-
The way that paywalls are
-
implemented is rather fragile in lots of
-
cases. Some websites publish the full
-
text for search engine indexing to get
-
up the Google rankings. Others allow a
-
certain number of articles to be read
-
before the paywall kicks in.
In some cases,
-
paywalls are really just presentational
-
elements on a page, rather than something
-
clever going on in the background of a
-
website. The point is that, in the cases
-
I'm familiar with, an AI's ability to
-
access paywalled content has come down
-
more to how the publisher has chosen to
-
publish their content in order to drive
-
the most traffic to it, rather than
-
some malicious act on the part of the AI
-
vendor. Perhaps if I left copies of my
-
new book on tables at coffee shops
-
across the city, I would sell more copies
-
or get more clients. And maybe not, but we
-
wouldn't consider anyone who picked up
-
one of those books and read it to be
-
stealing my content. The fact is that in
-
the online world, publishers often do one
-
thing with their content and then lean
-
on their terms of use to regulate it.
-
Where if you compared to a largely similar
-
approach in the physical world, it simply
-
wouldn't stand up to scrutiny. I have no
-
business in what any consumer of my
-
content gets out of it, other than
-
wanting to make the best content. If you
-
take what I say, or write and turn it
-
into a business 10 times the size of
-
mine, good for you. I have no remedy
-
further down the road to come knocking
-
on your door to say, "Actually, when I
-
charge you $12 for my book, I really meant
-
$12 million, because you've done so well."
-
That just isn't the way the
-
transfer of knowledge works in our
-
society, and creating a road for it to
-
become the way the transfer of knowledge
-
works, in my opinion, is a very dangerous
-
path. For published works that aren't
-
accessible on the open internet, there is
-
already a perfectly suitable
-
non-copyright infringing model that
-
could work today for AI training: that's
-
the public library. Libraries buy books
-
and other materials and are able to
-
share them widely in the pursuit of
-
maximizing society's knowledge. We see
-
libraries and their content as a public
-
good, and authors are pleased to see
-
their work in libraries for the most
-
part, rather than seeing it as theft.
-
Libraries are able to operate because of
-
both legal protections and licensing
-
agreements, and many are transitioned
-
into a more digital world where their
-
content isn't solely books on the shelves.
-
The concept of a library, both as
-
a source of knowledge to turn a
-
non-reader into someone who is well-read,
-
and a model for how protected content
-
can be shared without harming copyright
-
holders is, in my opinion, the most
-
readily relevant to the issue of both AI
-
training and AI model use. After all, if
-
someone who becomes a tech billionaire
-
shares a life story that involves them
-
starting with no books at home and
-
relying on the public library to allow
-
them to break barriers of social
-
mobility, they are lorded as an example
-
to us, not derided as a thief. Companies
-
like Microsoft, like Google, like open-AI,
-
have the resources to build the fullest
-
libraries known to our civilization, and
-
that is what they should be encouraged
-
to do to train the tools that will be
-
the next chapter of our planet. And those
-
who are trying to poison the well of
-
knowledge should not be held up as
-
warriors for the rights of creators, but
-
as troublemakers against one of the
-
pillars of the betterment of society: our
-
ability to build on top of the knowledge
-
and creativity that has come before us.
-
Ultimately, this is something from which
-
we all benefit, and in my opinion, solving
-
this problem is a simple as opening some
-
new libraries that are as relevant to
-
the AI challenges of today as they were
-
to the literacy challenges
of years gone by.
-
It might require tweaking some laws
-
or tweaking how some things are licensed,
-
but in my opinion, the
model is already there.
-
What do you think? Have I got this
-
issue confused, or should we be looking
-
for a simpler and more universally
-
beneficial solution to what has become a
-
very complex question? Let me know down
-
in the comments. Thanks for watching
-
through to the end. I hope this was
-
useful to you. Until the next video.
-
Bye-bye.
-
[Music]
-
bye-bye