Berkeley Technology Law Journal Podcast: GenAI and Copyright

Published On
December 4th, 2023

SPEAKERS Yunfei Qiang; Paul Wood; David Fang; Heather Whitney

Podcast Transcript:

[Yunfei Qiang] 00:13

Hello and a warm welcome to all our listeners tuning in to the Berkeley Technology Law Journal Podcast! I’m your host, Yunfei Qiang.

In this episode, we’re venturing into the dynamic and contentious realm of generative AI and its collision course with copyright law.

Generative AI is a cutting-edge form of artificial intelligence with the extraordinary ability to craft new content – be it visual art, text, videos, and even lines of code. With the remarkable capacity to create comes numerous intellectual property concerns, ranging from allegations of copyright infringement to the complex question of ownership over AI-generated works.

At the heart of copyright law is the challenge of balancing the promotion of progress in art and science with the rights of copyright owners. Right now, the AI legal landscape is a hotbed for litigation. Artists, authors, music labels, and other copyright owners have all filed lawsuits against AI companies for using copyrighted works to train their models.

Our journey today extends beyond just dissecting these legal challenges. We’ll explore the broader implications – the roles of various stakeholders and the future trajectory of policies in this space. What responsibilities do model developers have? How might the future of AI policy unfold?

We are thrilled to have two distinguished guests to guide us through these questions: Heather Whitney and David Fang.

Heather, an attorney at Morrison Foerster, specializes in technology transactions and artificial intelligence. Her extensive background includes clerking for Judge Diane P. Wood of the U.S. Court of Appeals for the Seventh Circuit, being a Bigelow Fellow and Lecturer in Law at the University of Chicago Law School, a Visiting Researcher at Harvard Law School, and a Faculty Affiliate at the Berkman Klein Center for Internet & Society. She was also an instructor for Harvard Law School’s CopyrightX course.

Joining Heather is a Berkeley Law alum and former BTLJ editor, David Fang. His expertise in technology transactions and AI is complemented by a computer science degree from UC Berkeley. David advises clients at Morrison Foerster on a range of technology and intellectual property matters, including generative AI and open source licensing issues.

Prepare to dive deep into this enthralling discussion as we unravel the complex tapestry of AI and copyright law.

[Paul Wood] 03:13

Welcome Heather and David and want to thank you so much for joining us on the BTLJ Podcast. I’m really excited to speak with you both about generative AI and sort of the challenges that this exciting new technology is bringing to copyright law.

[Heather Whitney] 03:25

Yes, we’re very excited to be here.

[David Fang] 03:27

Excited to be here.

[Paul Wood] 03:30

Awesome. To kick things off, I’m hoping we can start by giving our listeners a brief explanation on the technology we’re going to be talking about today. David, could you briefly introduce our listeners to generative AI and its significance in today’s tech landscape?

[David Fang] 03:46

Yeah, definitely. So I always think, you know, when any new technology comes out, there’s a lot of hype, there’s a lot of morphing it into what people want it to be. And it’s important to understand, like what actually was developed in this new wave of AI technology. So, you know, AI has been around for a while, we’ve seen it in a variety of ways in existing technologies, like something simple, like a spam detection filter in our email, or how major internet platforms serve ads to users. That area of AI historically has been using something called supervised learning, where you have a lot of data that’s labeled, it’s given as an input, and it’s ideally used to generate an output, for example, will this user click on this ad or not? Now in the last year, especially with ChatGPT, DALL·E, and all these new innovations, what was developed was two new things. One is a large language model, powered by a transformer. What that essentially is, a model that has the ability to predict language based off of training on large amounts of language. So that’s like a unique model architecture that was developed recently. The second one is the ability to generate images or content via prompts. And that was a new technology called the diffusion model.

So it’s really just these two new things, the ability to generate text based off of predictions, and the ability to generate images based off of prompts. And it’s as a result of two innovative new model architectures. And I also want to say like, these model architectures aren’t necessarily something that was developed this year, either. I think, like the transformer model was developed by the Google team, back in 2017,¹ we’re just recently starting to see the commercialization of all that technology now, where users are really getting access to these tools, and be able to see the cool things that they’re able to do with them. So really, AI, as it may have seen, for a lot of us to be like a sudden explosion, really, it’s been this long historical development over many, many years. And we’re really just starting to see the fruits of it right now. And you’ll start to see that continue, as well, the next 10 years that we have these new and exciting tools. And I think the other key difference with these tools is they’re starting to do things that, typically we would expect someone in a, fairly highly educated job or white collar job to do, for example, like generating freeform text, that displays some logic and synthesis of information, or an artist, for example, where you’re working on content, or software engineer where you’re wanting to generate specific code. So that’s the next. It’s a paradigm shift, in some sense, where historically we’ve thought of automation in terms of my automation, like a factory, or having certain jobs with a machine. But now we’re starting to see that, things that you develop, in like, a desk job, for example, those things can also be potentially automated as well. So that’s at a high level, the landscape of what it is now and you can see why there’s both a lot of excitement, a lot of fear all over the place, in reaction to this technology.

[Paul Wood] 07:36

Definitely with these two new processes becoming more commercialized, as you said, more commonplace. Imagine that’s where it starts to intersect with copyright law. And I’m wondering, Heather, if you could tell us from your perspective, just some of the basics of that field of law, so we can ground the rest of our conversation on those principles. And what’s the purpose of copyright law?

[Heather Whitney] 07:38

Yeah. So copyright law, it’s for listeners who don’t know . . . I’m just gonna assume you don’t know anything about copyright. There are two main intellectual property types that people think about when it comes to inventions and works, and that’s patents and copyrights. Copyrights have to do with original works of authorship. So not inventions, original works. And so the question is just what those works are, and what works are covered. And there’s a list of those kinds of works. But in general, I think the way to think about it is, copyright protects your interest in what is considered the protectable expression, the copyrightable expression that you have contributed to a work that you have created, right? So like an easy example, like a novel, right? You wrote the novel, there’s a lot of choices that you made, that were creative choices that show your creative expression in that work. And then that is protected. But not sort of high level ideas, which will end up being relevant when we talk later. So think about Harry Potter: JK Rowling, tons of creative expression, those parts are definitely protected. The idea of a boy wizard going to a magical school? Not protected. That’s not really creative expression. And so that’s the high level of what’s protected: original works of authorship. And then you get into the cases and the questions about what that really means in this context, which we can talk about if you want. But I think that’s good enough to get us going.

And for motivations. So different countries actually have different motivations for why they have copyright protection. The United States has taken a distinctly utilitarian approach, which is to say that the reason that we provide these rights to people, copyright, is because we believe that it actually results in more good for more people, if we give those rights for a limited period of time, and then those things enter the public domain. Other countries have other values that also motivate why they have those rights and can explain why there are some differences between the rights of different countries, like this thing called moral rights. In some countries, particularly in Europe, the idea that your work expresses who you are, it’s kind of this Hegelian concept of your personality and identity being in that work is much more prevalent. And so there’s much more of an idea that the law should be protecting you as an artist’s ability to stop other people from sort of messing with your works and modifying them in particular ways. But that’s not so much a thing in the United States, we do have some protection. But it’s pretty minor compared to other countries. So I would just think of it as we give these protections because we think that we’re all better off by giving some protection than by not giving any protection.

[Paul Wood] 10:55

Okay, I want to focus on something you said there about the concept of original authorship, this might be a contentious question. But currently, where’s that line being drawn? Certainly, I don’t think anyone would say if I took a photograph of a painting that someone else made, that’s my original work.

[Heather Whitney] 11:14

Yeah, that’s great. So there’s a couple pieces there. So the first is what does original mean? And then there’s a question of what does work of authorship mean? Original ends up meaning two things. One is that you didn’t copy it. So it’s sort of original to you. And the other is that there is this modicum of creativity, there’s a little bit of your expression in the work. Those are the two requirements for a work to be original. So an example of where that’s not met is going to be a case where you may have made a choice that you might argue had some expression in it. But it doesn’t have enough to basically make it over that very little threshold. So putting something in alphabetical order, probably not going to be enough in most contexts, because that’s a really expected not enough creativity kind of a thing. But most things other than that make the cut when it comes to a work of authorship. That’s where there has been a lot of discussion right now in the generative AI space because there were two kinds of questions here. One had to do with the question of whether or not an AI can be an author. Because it has to be a work of authorship. The question is who can be the author of that work of authorship? So that was one question. That ends up being not a particularly interesting question, but we could talk about it in a second. And then the other question is whether what it takes for a human who thinks of themselves as an author of a work that they’ve created using these kinds of tools? What does it take for them to kind of get over the line, and for that to be considered their work of authorship?

So on the AI as author piece, the copyright office has been very clear by the court, it’s all very clear that an AI cannot be an author for purposes of copyright. You have to be a human. So we could say instead, now that it’s an original work of human authorship is what’s required. So that’s sort of done, I would be very surprised to see that change without some sort of legislative intervention. That piece is not that interesting. And in reality, it’s not that interesting, because most people are not trying to register works, where they are going to say that it was like 100% made by artificial intelligence, right? No one’s saying that they think that they are the copyright holder, they are the author, and they are trying to get protection for their own work for their own benefit, right? So it’s very rare that this happens, it’s more like you just want to make a statement about AI as being authors. On the human side, that’s where it really is very, very unclear what the answer is going to be. Right now, the Copyright Office has essentially taken the position that the delta between what you put in as an input and what the output is, for these models, is not copyrightable. Because it was essentially authored by the model, by the program that you’re using. So people are pushing back against this. And we can talk about a few cases where that’s where that’s happening. But that’s the general position that they’ve taken, they basically publish it, notice where they were looking for people to give responses and their feedback on the way that that works. And so maybe that will change, but I would be surprised to see them move very much on that.

[David Fang] 13:34

And also, this is where the understanding of copyright law and also the understanding of how the technology is used meshes together. Because, for example, on that issue that Heather was talking about, in terms of like “who is the author?” or like, “what is that delta?” To actually understand how an engineer is using ChatGPT to develop code and you know, making certain changes, or how an artist is using Stable Diffusion, to basically generate images. And what level of control do they have over those tools and to analogize to things that people are familiar with today. That’s where, you know, lawyers with the ability to both dive into the technology, and the ability to understand the law as well can really help navigate this space. Because it is an evolving and challenging field. And like we see with any technology, there’s a lot of different stakeholders, a lot of people with different interests. So being able to understand what’s going on, and how to represent those interests is super important at this time.

[Paul Wood] 15:43

I want to touch on your ending note there, David, about understanding more about the technology at play here. As I understand there’s two main areas of contention here: there’s the input, and whether they are allowed to make use of copyrighted material. And the output, that’s a separate issue. Why don’t we go through those one at a time? Could we talk a little bit more about the input process? What does it look like? How do we get these programs off the ground? How do they start functioning?

[David Fang] 16:15

Yeah, so I’ll take even one more step back. Like even before we even get to the input, I think there’s basically three categories of where potential risks can lie in terms of legal risks, versus in the training or the actual development of the model before it even reaches a consumer. So, with any type of AI model, a large training process needs to happen beforehand, especially with, for example, large language models like ChatGPT. That training process can involve tons and tons of data. So what you’ll commonly see, for the initial versions of these large language models, you’ll see the using datasets from a source called Common Crawl, which is, a nonprofit that’s basically crawled the web for 10 plus years, and gathered like text information from the internet. You’ll see things like, you know, Wikipedia or Reddit, a little bit more human-curated datasets. So there’s firstly the issue of, is the action of training itself potentially infringing someone else’s copyright as an unauthorized reproduction, or unauthorized creation of a derivative work in the model, something like that. So that’s number one. The second thing is once you have this model, now you have both the user providing an input to the model, and then the model generating an output. So, from the input standpoint, that’s something within the user’s control, right? For example, if I wanted to create an image that’s really similar to Mickey Mouse, I could give an image of Mickey Mouse, and then maybe the model understands it in some ways, and it generates another similar image of Mickey Mouse. So, there’s the input as in like, there’s the user directing infringing activity. And then there’s the model generating potentially something that’s an unauthorized copy as well. And having understood that, like, who has the ability to control these various steps, then that’s where vendors and providers of these models will take steps to try to mitigate the potential legal risks.

[Paul Wood] 18:44

How vast does the amount of data they need have to be? How comprehensive do they need to be for a model to function?

[David Fang] 18:55

Yeah, a lot, to put it very short. To take another step, these new models have been called foundation models. And they’re called foundation models because they’re so generally trained that they have the ability to adapt to a variety of use cases. So just like you and me, we have to learn a wide variety of things. But then when we face a new situation, we use our previous learning to adapt to that new situation, it’s not that we’ve seen that new situation and we have trained exactly on it. So in order to be able to do that, especially for something like language, you have to learn human language. And also like, what’s a question? What’s an answer to a question? In order to do that, machines just learn in a very different way than people, you need billions of examples, largely. A very large amount of examples, and ultimately, like what the model is doing, and its training process, it’s predicting the next word, or the next token, basically, where you’re giving it all these inputs, it’s seeing a first part of the input, and it’s trying to figure out what would most likely follow, and over billions and billions of training over and over again, it eventually develops an understanding of the relationship between words. And then using that somehow, and this is, I think, where a lot of the magic happens. And, you know, people aren’t sure exactly why. But for some reason, when you get to some order of magnitude, and training, then these models become generally adaptable, right? Where they seem to have understood or have recognized some type of general relationship between what people want, as an input and an output. And be able to do that without training on something very specific. So you can see, it’s a pretty complex task. And because of the complexity of the task, you need a lot of data. And that feeds back into the legal issues where to create these truly revolutionary models, you need content from essentially everyone.

[Heather Whitney] 21:28

Yeah, if you think about this as just another thing for listeners, imagine you have a computer program that has no knowledge of anything. Nothing, nothing. It’s starting as a blank slate. And you are creating something that people can ask questions. And it can actually produce answers that turn out to be relevant in a lot of cases. If you can see how much training data there has to be in order for there to be something. That does not understand anything, right, but being able to, by merely predicting the next words, be able to create something like that is incredible, right? They were not when they were creating these models. Originally, the idea was not, “Oh, I’m going to create this thing like a ChatGPT tool, where I can just answer people’s questions.” The fact that it could do this is in some sense, an emergent property that came out of the fact that what happens at a certain level of being able to predict, so it’s just absolutely massive amounts of stuff. But it is also important for what David was saying. The difficulty here is that we constantly use these sorts of human words like understanding and training and learning and all these things, but what we’re talking about in this context is just a radically different thing than what we mean when we talk about a person learning or training or anything like that, right? These models do not understand the outputs that they are generating, it is merely predicting the next things in a line. And to us, it looks like it has a lot of meaning because we’re looking at something that looks like a person wrote it. But in reality, the model itself, there’s no meaning there. Right? It is a very interesting thing to think about. But that’s, I think, important. And one of the reasons why sometimes you get these really weird outputs – things that are not true. It does not understand it is just generating predictions. And so it relies on its training data to understand what the appropriate prediction should be. And so it is completely dependent on that, in order to sort of understand. What is it going to think is the next thing in this line depends on what it was trained on, if it was trained on just Reddit. Imagine that there were billions and billions of pieces of Reddit. Its output and its belief about what would be next can be dark, right? Because it’s right. But if it’s using something different, like Wikipedia, or happy places on the internet, it would have a different view about what it predicted is the next thing. So this is why the training data matters, in terms of both its accuracy, but also in terms of sort of what are the values that are going to be articulated in those outputs, because it is reflecting what it has been trained on.

[David Fang] 24:02

I think to add on to that. One of the initial use cases for ChatGPT, for example, was people trying to replicate search. And in my opinion, search is the wrong way to think about these technologies. Like when we have a Google search, you have a large amount of index information, and you’re searching for something that matches, like it’s retrieving information then it could output to you, versus what these models are doing. Like what Heather was saying is that they’re generating predictions, it’s not like, there’s some fixed stored information that it’s saying, “Okay, this is what you’re looking for, I’m gonna go retrieve it, and then give it to you.” It’s more that the model has developed a sense of probabilities, based off of input, and what it thinks should be the next likely word, the maximal probability, it will output that. And then it keeps doing that over and over again until it generates the response. So, it’s a much different framework in thinking about these things. There’s not like a set world of things. And in some ways, even though it is always tempting to use human analogies because that’s how we relate and see and understand things. It is almost in a sense, how you and I think about things as well, right? We don’t have in our minds, like a very set thing that we can go necessarily retrieve directly. Sometimes it’s a mix of a lot of things. Right? So in some ways, maybe that’s why the models are able to generate things that we expect humans to because it’s trained in a way to output what we would expect the human output.

[Paul Wood] 25:58

So I’m gathering from our conversation that these data sets for training are hugely important and require possibly incomprehensible amounts of data. And ideally you want some sort of human input. David, I think you mentioned Common Crawl earlier, looking for stuff on the internet to feed this model. How does that sort of implicate fair use if we’re making use of copyrighted works to help train a model?

[Heather Whitney] 26:26

Yeah, so this is the subject of lots of ongoing litigation. So to explain to the listeners, what does it really mean to have a copyright? What is copyright? It is, you can think of it, if you were in law school, as a bundle of these sticks of different rights that you get to have. And when we talk about rights, it’s more like a right to stop people from doing something versus a right for you to do something. So one of those rights is the right of reproduction, which means that you have the exclusive right to make reproductions of your work. Another one is to prepare derivative works. Derivative works are essentially like spin-offs of the thing that you have currently made. Harry Potter fanfiction is an example of this there. The fanfiction relies on some of the creative expressions from JK Rowling. And then they’ve added their own to make the sort of fan versions of like, what’s going to happen after the end of the books, whatever. So those are two of the ones that are important. There’s also about sort of transmitting and that sort of stuff, but let’s just focus on these ones. So “fair use” is an affirmative defense that defendants put forward when they have otherwise violated one of the copyright holders’ exclusive rights with respect to their work. And they’re basically saying: “It’s true that I did this thing, but it should be allowed.” And the law has this concept of fair use. There are different policy justifications for this. There are some cases where we actually want to limit what the copyright holder can stop other people from doing, because we think the goods out of that thing are sufficiently big that we should just let it happen. So that’s what fair use is basically doing, it is an affirmative defense when you’ve done something that would otherwise be infringing. And the way that it is looked at is there’s four defining factors. And then there can be additional factors that courts and juries can look at, to decide whether or not they want to say that a particular use is a fair use. So in other words, we’re going to not hold this person liable as a copyright infringer because we think that it sort of meets these requirements. So when it comes to fair use in the question of training data, the question is basically: should the training of a model, depending on what the model is and depending on its doing, should our reproduction of people’s copyrighted works without permission be permitted in the training process? Because we think that the value, essentially, that’s coming out of creating that model is sufficient that we should basically make an exception. And the fact that there’s a couple of things about the massiveness of training data that goes into the question about fair use, but one of the things that’s important that it depends on your understanding of technology is, are you creating copies? What is really happening there is you’re creating copies in the training process in order for these models to understand just the relationships among words, right? So in most of these cases, the idea that the creative expression, the part of a work that really is protected by copyright, is not really what it’s aiming for. It’s really aiming for a huge amount of training data in order to learn across all training data, basically, to be able to predict what should come out next. So that goes into basically what is really being used from these works. Another piece that is relevant, we could go through all of these, but one of the big questions in fair use ends up being whether or not the use is the character of the use, whether or not it is transformative.The Supreme Court’s recent decision in the Warhol case, also mentions that, in certain situations, that commerciality can also be more important than perhaps it was in the past, particularly when it is neither transformative. So it’s not transformative, and on top of that, it is commercially used. In this case, most people, certainly in this area, take the position that use and the purpose and character of the use in the context of creating and training general LLMs (Large Language Models) and other tools, is transformative. Because you look at what you’re talking about, right? Like somebody’s random post on Reddit, or someone’s picture that they posted on the internet. And you compare what that thing is, with the absolutely radically new thing that you have now created, which is this very sophisticated prediction model. And they say that that is transformative. And there are examples of people throughout other cases where you have these new kinds of technology. And they’re using other people’s copyrighted content as part of that new technology, and saying that it is also transformative. So there are a bunch of different factors that basically come into this. A couple that matters the most is, like I said, transformative-ness and the effects on the market, which we’re going to talk about, if you like, but that’s sort of what’s happening right now. And a bunch of litigation is whether or not that fair use affirmative defense is going to work.

[Paul Wood] 31:52

So, taking those lawsuits into consideration. It seems like, if I understand your right, that it might be considered transformative, but in the hypothetical where it’s not, and they’re barred from using copyrighted works, what kind of alternatives are they left with for training data?

[Heather Whitney] 32:10

So there are examples, right? So Adobe’s Firefly was trained, they have said that it was only trained on either things that are in the public domain, which means that the copyright is not an issue, because it’s public domain and works where they actually got affirmative licenses from the copyright holders. Really, in that case, my understanding is that they have a very, very, very broad set of rights to use the user’s images that they give for their stock photo marketplace. And so which included the right for Adobe to actually use those in order to train. And so it is possible to create a generative AI model that basically works without authorization. Whether or not the quality is going to be the same as that model is something that is discussed. I think the general consensus was that Firefly was not as good as, say, the other options that were out there that were trained on much more data, including data that people probably didn’t have authorization to use. But that is the other option. Another possibility is that moving forward, as the technology continues to advance, it may also be the case that in the future, you just need a lot less training data, but they’ve gotten better at making models more efficiently. So there are a couple of different things that can play in what we will see, I think in the future in that space.

[David Fang] 33:47

Yeah, I think another way to think about it is, the question isn’t really, “what’s the alternative to copyrighted works,” it’s more, “if this is infringement, you would need a license,” basically, right? You still want to use the same content, you would need a license to it, or you’d have to use things where copyright protection has expired, which is the public domain. Another interesting thing to think about is, as more people use generative AI models to create content, and that content’s on the internet, and that content may not be protectable by copyright, because it’s AI generated, or it’s some mesh of it, right? For looking, there could be a lot of new content on the internet that is actually AI-generated. And then people might use that to train an AI model, and that will have future, other ramifications going forward. But it’s something interesting to think about now, as we adopt more of these models to ultimately create the content that may be used to train in the future. What does that mean?

[Heather Whitney] 34:57

And you can also think about the question about this licensing issue, right? If you think about it, it’s actually a very hard collective action problem, and in many cases, fair use is really designed to sort of deal with these kinds of problems. But if you want to train on every random post on the internet, because it’s really helpful for being able to predict it, you can just have this basically an unlimited amount of data, it’s the idea of being able to get a license from each person for each random thing that they have posted on the Internet is effectively impossible. You cannot go out there and find that, it’s not a doable thing. And so you need something to step in, if you want people to be able to have the ability to train on that kind of stuff; either, you’re going to have to have a fair use sort of exception, or there has to be some very radical change, big change, in how you are able to gather people together, in order for them to sort of license their material, the endless amount of crap that you’ve posted on the Internet. So that is one of the main challenges in this space, is that we each have just a little bit of stuff on the internet relative to everyone else. But having all of it together is very useful for the development of these kinds of models.

[David Fang] 36:11

And I think that also relates to like, for the recent ongoing lawsuits, like who are the parties actually suing—they tend to be parties that are either aggregating a lot of individuals, like an Author’s Guild, for example, or large copyright holders with very valuable proprietary copyright content sets, like Getty, for example. So there’s definitely that collective action problem. And the people that are raising the issue now are the people who either have enough of this content and like revenue to bring these claims, or something like joint interest where someone’s representing a lot of people.

[Paul Wood] 36:57

David, I want to go back for a second to something you were talking about a couple of minutes ago, something you mentioned about how these outputs from the models can now be out there, I think I came across something related to that concept called “synthetic data.” Is that something you could maybe speak to a little bit?

[David Fang] 37:16

Yeah, so synthetic data is just a way of seeing computer-generated data that’s used for training an AI model. So typically, you want to use data that’s collected from the real world. For example, from the real world interactions, and theoretically like, what an AI model is doing is its understanding like the statistical properties within that data set. So what synthetic data is trying to do is, if I don’t have necessarily rights to that data, or there’s some privacy considerations, or there’s just not enough of this data, right, we need to fill in a gap. The only way to generate that is to have a computer also understand the statistical representation of what the real world data is, and then generate new data based off of that, hoping that it will reflect what the real world actually is. So that’s what synthetic data is. It’s basically, when I don’t have enough training data, where I can’t obtain it, how do I generate something to create that. And the interesting thing is there are use cases of these generative AI models creating synthetic data as well. Because, they can be used, especially like text data, right? Like, instead of having someone review, and write, all of these things over and over again, or create a really complex model to do this, you have a large language model take the role of someone that’s evaluating these outputs, if they’re potentially used as synthetic data to train the model.

[Heather Whitney] 39:05

Yeah, and I will say one more thing, because I think the synthetic training data thing is super interesting. I think it’s very important when people talk about synthetic data as a solution to identify what problems you think the synthetic data is solving. Because one of the issues with synthetic data is, in order for anybody to make fake data, synthetic data, you are already using some ground truth that you have decided on about what the world looks like, or whatever it is that you want, is supposed to look like. And now you’re just sort of making it play out with a box as an example, right? Say that I’m doing a research experiment, and I’m going out there, and I want to know, my research is on what people with a public opinion on a certain topic. And I’m lazy. So I go out there, and I get it from interviews with 20 people. But my project is supposed to be understanding what the city of San Francisco thinks about this topic. So what I can do is, I can say: “Oh, yes, well, I already have these 20 examples, so I’m just going to generate, basically all of San Francisco based on those 20 examples. But of course, the idea that those are going to be representative, the whole point of getting lots of different actual people to give you their information is because it might turn out that people have very different views in different areas. And that you would not collect that, you would not see that if you only looked at a very narrow neighborhood, right. And so that is also the problem with synthetic data: is these sample and set that you’re using as your seed in order to create this data, is that already representative of the thing that it ought to be representative of, because it’s not going to magically cover things that it’s not already represented in that data. So the privacy thing can make sense because you can take out certain kinds of personally identifiable information and then create data. But in other areas about sort of representing the world, it’s not going to do what people sometimes seem to think that it’s going to do.

[Paul Wood] 41:03

So interesting to note that that’s not really going to be a silver bullet solution here. But I was wondering what your guys’ take would be based on the number of models that are already out there that have been trained using copyrighted data. Isn’t it a little too late to solve this problem for litigation, is it too late to sort of put the toothpaste back in the tube? If they decided that you weren’t allowed to use copyrighted materials as part of training data, even if you were to use synthetic data, isn’t that partially coming from already copyrighted material at this point?

[Heather Whitney] 41:38

So there’s two pieces, so on the latter piece, which is about using synthetic data instead of real, or organic data, one of the reasons that you might think that that would be safer from a copyright perspective, is that you might think there’s a different fair use analysis if I’m taking your content, and then I’m using it to generate synthetic data that’s further removed from your content. And then I’m using that to train the model, right? And so that might be for a variety of reasons, a better fair use argument than just saying, I’m going to use your exact data in order to directly train the model. So that could have some influence. At the end of the day, I think it just depends on what the relationship is between this and that data and the data that you’re talking about. And whether it actually has copyrightable expression in it at all anyway, but that’s like a whole thing. But in terms of putting the genie back in the bottle, all this kind of stuff. I think that the biggest challenge with this whole “putting it back” is the open-source part. Because you already have these stable diffusion model rights that are out there that people can use, I have like versions on my computer, right? And so that’s already out in the world with weights, these biases that are basically created because of the training that it did on these materials. And so you’re not going to be able to ever sort of get all those things to be deleted off of every computer across the universe. But when it comes to the most powerful models going forward, those are things that are not necessarily shared. They’re not open source, all of them are not open source. So it’s easier to stop people at that level. But the open source thing does present a big challenge. Yeah, but then it cannot be completely solved. I think so. So in some sense, yes, there’s no way that you’re going to get back that stuff. But there are plenty of reasons to think that you have the ability, as the government, to stop people from doing things on a go-forward basis, making new stuff. And so there’s, it would just have a longer time to go into making a big difference in what people are using.

[David Fang] 43:49

I think another way to think about it is the core reason probably why people are bringing these cases now is ultimately an economic reason, where we’re trying to figure out what this new market looks like, between people who create the content originally, and are the copyright owners of that content, versus the people that need that content to train these new models that are used for a variety of different things. And in some ways, whether the genie can be put back in the bottle doesn’t matter as much for that question. Because it’s about, maybe there’s some compensation that didn’t occur before. And through this litigation process, we can achieve some level of compensation and then set a new model going forward. Or, just thinking about it from a plaintiff’s perspective, if I ultimately achieve the result that this is not fair use, then I can go out and issue a bunch of injunctions for future models that basically fall into the same situation to then enforce my rights, on a going-forward basis. So, yes, there is probably a lot of harm that’s done already, that can’t be put back in the bottle. But you can all see, this is why these cases are so important, right? Because they literally will define what you can or cannot do on a going-forward basis, and the incentives of the different actors within the space.

[Heather Whitney] 45:05

I think that the compensation thing is super interesting and important, and something that people should sort of think through these physicians. So a lot of the discussion and a lot of the outrage in different communities about the use of their materials without their permission, is because they think that they should be compensated when their works are used as training data. That’s the supposition. And so the question that I find sort of needs to be asked is, how much money are you going to get for your materials that are used as training data when you know that it takes billions and billions and billions of pieces of training data in order to train a single model? How much money each person gets for their one contribution is like, in any situation, is like a fraction of a fraction of a fraction of a fraction of a penny. And so it’s not clear to me that the real good thing that people are concerned about, which is that they think that they should be compensated, is actually solved meaningfully by actually creating any sort of compensation regime because you’re going to get like no real money from it. And there are already problems that we have today with the kinds of licensing regimes that we have, where people are giving their images over to Getty and that kind of thing, about how little money they get from that licensing either. And that’s when people are licensing their stuff for actually using the expression. Right here we’re talking about using one thing and billions and billions, basically, to understand relations between words or whatever, between concepts in them. I just think people who are concerned about compensation for artists, for anybody whose data is used, to think about what that compensation model really is supposed to look like. And whether or not it is really compensating people in anything but a symbolic way. And whether or not all that we really get out of it. Yeah, it’s some kind of symbolism. Is that really worth that because there are costs associated with having a licensing regime or forced licensing, which is that only some players are going to have the money to pay that, right? And which means that you’re going to have less entities that are able to create new models. And then you’re going to say the same kinds of bigger players are able to do it. So people who are concerned about or want there to be a diversity of people who have access to these models, who can create these models, in some sense, the compensation model is not going to actually give anybody any meaningful compensation. And it’s going to higher the bar in order for you to get into making stuff. It’s not clear, like, what are the benefits that you’re really getting out of that? So it’s something to be able to think about?

[Paul Wood] 48:13

Yeah, and I think it’s a great point because I think this is where the incentives and interests of a huge content creator versus an individual artist really differ here. Because if you’re on the side of someone who already aggregates a lot of this content, and you can form a brand new revenue stream, and you have the resources to assert these things already, then it makes a lot of sense to try to go out and get this ruling in your favor. But if you’re an individual, and then you might get this ruling, but then to practically make that become actual dollars in the pocket, you’re gonna need to join probably some very large organization, and it’s a minor contribution in something very large. So, in some ways, maybe the new law or something interesting can come out of this, because one of the important questions that I think is worth asking as we develop these new tools is, what is the ultimate impact on society, right? And if we’re ultimately impacting a lot of individuals who formerly had good jobs, or creating a well-being off of this, but then suddenly, they’re disruptive, that has huge social ramifications. And sometimes those ramifications are things that aren’t like that, that’s the role of the government to think about these things. And to address them in a way in which society can get on board with. So you can see how this brings it back to the original point of why this technology is so important, is because it has the ability to change our fundamental assumptions on who is creating value and where that value aggregates in our society, and that has big impacts across the board.

[Paul Wood] 50:13

I think that’s an absolutely fascinating place to sort of leave our conversation about input for now. I want to go ahead and shift gears for a moment and talk about AI outputs. How do creations from models like ChatGPT and Midjourney challenge our understanding of copyright?

[Heather Whitney] 50:32

Yeah, so it challenges our conception of what it takes to be an author, or at least it requires us to pay attention to what we think it means to be the author of a work, in a way that we maybe haven’t really been thinking about for a very long time. So one of the examples that’s often given is the camera, and photography. When cameras first came out, there was a similar response, which was that there’s no human author of these works, these photos are not copyrightable. Because it’s this machine that’s doing all the work, to transfer the thing with the light and image, you aren’t really that involved. But essentially, for now, you’re pushing a button. And that’s all that you’re doing. There’s no human author here, right? And it took time for people to think that actually no, the camera is a tool that is used by people in order for them to create their own kinds of art, or works, basically. And there is some amount of creative expression in it. So that’s what happened with photography. And then we saw over time that the decisions on photography cases became more and more, I would say, lax, in a certain sense. So in the very first case, where the Supreme Court said that photographs can be copyrightable, and that the particular photograph at issue was protected, a picture of Oscar Wilde. They went into detail about all of the creative decisions that were made by the photographer that were expressed in the final work, lots of things. What was being worn, where the drapes were, basically setting up the composition of the image. What those kinds of questions we’re trying to evoke a certain feeling, there’s a lot of discussion about human contribution there. So that makes sense, right? Because you’re trying in the early days to justify why copyright extension, why copyright can constitutionally extend to photographs. But over time, when people were less freaked out by photography, in more and more cases, we’re basically saying the default expectation was that you are the author of the photos that you make, right? And you are seeing cases where there’s a copyright in works where you’re like, is there human expression in this work? Like, where is the human expression in this work, right? Because as an example of wildlife photographers, wildlife photographers often set up motion detectors, in order for pictures of animals to happen at night when they are not there because who wants to be there in the middle of the night, and who really wants to be there taking a picture of a lion in the middle of the night. So there are lots of reasons why they do this. But that’s what they have. And they set it up. And then they take the image, and the images are often really awesome. And then they register those works and say that they are the authors of those works. But if you think about it, and you ask yourself, well, copyright extends to original works of authorship. There has to be some modicum of the author’s expression in that work. The question is like, what is the expression that is present in those works, right? The human wasn’t there, maybe that’s not a big deal, because they set certain kinds of parameters, like they decided what like, what filter was going to be on the image or where they were going to place the camera, you have to stretch to see what those would be. And the interesting thing about the use of generative AI, is that in some sense, these artists are doing way more than many, many professional photographers are doing in terms of the creative piece, right? It’s a huge amount of work in order to set up all of this stuff in the middle of wherever you want to take these pictures. But copyright does not protect, at least in theory, the sweat of the brow idea, right? It’s the idea that you worked really hard, and therefore you get a copyright, that is not a reason, that is not a sort of part of the test. It’s just, is there copyrightable expression in this work? Yes or no. And so going back to the generative AI situation, we have a client, Chris Casanova, where we submitted an application for a work, I want to say recently, but now it’s been quite a bit of time, I think it was like in March or something like this, where we went and put forward all of the different settings that Chris was able to set, in order to generate the ultimate image. That included putting in a sketch of a picture, and setting a bunch of these parameters using control nets, which gives you, unsurprisingly, given the name, more control over the output. So if you think about all of those decisions, all of the creative decisions that were made on that front end there, and you compare that to the kinds of creative decisions that photographers are making on the front end of what they’re doing, it’s hard to see much of a difference, right? It’s actually harder to see why photography gets copyright protection versus I think a lot of these generative AI images, not in all cases, but that’s the same with photography, right? You might think that when you and I push the button on our phones in order to take 8000 images probably shouldn’t be copyrighted. But it probably is. And so similarly, you could have a similarly nuanced view about authorship for generative AI images. But the problem is, it’s hard to do that. Because when you look to register a work, historically, the Copyright Office does not look with a lot of detail at what you were doing. They’re not doing a case-by-case serious analysis of each work in the way that you see patents being analyzed. But it would maybe be required for what the correct outcome might be from a copyright sort of black letter law perspective. But that’s not what happens with photography. And that is probably not able to be really done in the generative AI context, either. So that’s sort of where we are at figuring out what we are going to think an author is today. And how I think the decision in these cases impact decisions that we’ve already made historically using other technology like cameras. And what we think it means to be an author? It’s not very clear.

[Paul Wood] 57:06

Sort of given all the things we’ve talked about today and the rapid progression of this kind of technology. Where do you anticipate copyright law heading? How is it going to adapt or transform in the coming years?

[David Fang] 57:20

I think it’s always very tempting. When new technology comes in to say, the laws old needs to change we need to flip it in some other way. But I think over time, or at least how copyright has been interpreted in the courts, like in the grand scheme of things, it seems like we’ve done a decent job of interpreting copyright law in general, as it evolves over time. And it has core principles that I think are long-term principles that I think will still be true even with new technology. So I feel like it’s trying to understand how copyright law can be applied, understand those same foundational principles to a new technology, rather than necessarily writing a new law entirely. Unless the interpretation ultimately comes out to something that’s so harmful that Congress will need to step in, and change the law effectively. But that’s my interpretation of it. The pace of technology is just so fast, it’s hard to even know what the next year of generative AI will look like. So I can’t imagine being able to write a law, that’s a set government for the next generation of new technology. So I almost rather stick to existing principles to offer some level of predictability rather than blindly casting a net into what we think the future will hold.

[Heather Whitney] 59:04

There’s a couple of things I would say. One is, we have a recent decision in Reuters v. ROSS, which was a case that was first brought, I think, in 2020.² And just recently had a denial of summary judgment to both sides. And in that decision by the court, you got to see some of the early analysis from a court’s perspective of the fair use question with respect to training a model, basically scraping data off of site, and then using them to train a model. And I think it was pretty favorable to fair use. And so I think that if that’s sort of reflective of what’s going to be happening in the future. I think we should expect that fair use will cover a lot of training. So I think that’s probably the direction that we will be going in. And I think that the court did a good job of analyzing the factors in that case. On the output side, I’m very curious to see what happens because I think it’s a super difficult question. And I think that it’s not really clear what a good solution is, for the reasons I gave before because of the decisions that we’ve already made in other areas where we use technology to create works, and how that’s going to apply here. In some sense, it seems reasonable to say that if you basically just push a button, you’re not going to be the author of a thing that just comes out when you push a button. And it also seems to be true that when you’re exercising a huge amount of this control, making the creative decisions that then are reflected in that work that you should be. So finding out where that line is going to be. And I think the difficult part is how you draw this line, given the fact that the Copyright Office does not have the resources to do the thorough kind of analysis that you might need. In order for that case-by-case determination step to really be happening, I think it’s more likely than not, I guess what I will say that, ultimately, copyright protection does extend to works created using generative AI tools, as in the delta between what you put in and the output, in many cases will be protected. And I think that’s because of the same sort of pragmatic things that motivated the change and the expansion of protection and photography. And I think the fact that you have a lot of very interested parties who have a decent amount of power in the United States and in the media, who are very interested in and have been already using these kinds of technologies, in works like movies, where they’ve invested tons of money. And so their interests are very much aligned with those of others who want to make sure that those works are protected by copyright. And so it would be pretty surprising to me if, in the end, the Copyright Office did not move into that direction. But we will have to wait and see.

[Paul Wood] 1:02:13
Thank you, Heather, and thank you, David, for sharing your invaluable insights with us today. This is obviously a very complex and intricate issue and I hope that our discussion today kind of underscores the evolving nature and challenges it presents. For our listeners, we hope this episode has shed light on some of the nuances of this issue and its implications for copyright law. As we look to the future, the intersection of technology and legal frameworks will only become more intertwined. It’s essential to stay informed. Thank you for tuning in to the BTLJ podcast.

[Yunfei Qiang] 1:03:13

Thank you for listening! This episode is brought to you by BTLJ Podcast team-members Paul Wood, Yunfei Qiang, and Wan-Yi Lin. The BTLJ Podcast is brought to you by Podcast Editors, Eric Ahern, Juliette Draper, and Meg O’Neill. Our Executive Producer is BTLJ Senior Online Content Editor, Linda Chang. BTLJ’s Editors in Chief are Will Kasper and Yuhan Wu.

If you enjoyed our podcast, please support us by subscribing and rating us on Apple Podcasts, Spotify, or wherever you listen to your podcasts. If you have any questions, comments, or suggestions, write us at btljpodcast@gmail.com.

This interview was recorded on October 30, 2023. The information presented here does not constitute legal advice. This podcast is intended for academic and entertainment purposes only.