The Licensing Vector: A Fair Approach to Content Use in LLMs

“Despite efforts by LLM providers to avoid reproducing lengthy excerpts from single works, strings of words from ingested works persist in LLMs. This has significant legal implications….”

A spate of recent lawsuits is shining light on how some generative AI (GenAI) companies are using copyrighted materials, without permission, as a core part of their products. Among the most recent examples is the New York Times Company’s’ lawsuit against OpenAI, which alleges a variety of copyright-related claims. For their part, some GenAI companies like OpenAI argue that there is no infringement, either because there is no “copying” of protected materials or that the copyright principle of fair use uniformly applies to generative AI activities. These arguments are deeply flawed and gloss over crucial technical and legal issues. They also divert attention from the fact that it is not only possible but practical to be pro-copyright and pro-AI.

Copyright and technology both move society forward. The goal of copyright, as articulated in the U.S. Constitution, is to promote progress. This goal also can be achieved, albeit in different ways, by technological advances, including AI systems. Copyright and technology are not enemies, but instead can work together when there is respect for the copyright laws that encourage creation of the trusted content that technologies require.

Understanding How LLMs Operate

AI is a great example of this relationship. GenAI systems use copies of content like books and articles, many of which are protected by copyright, for training their LLMs. Copyrighted content is pivotal for training because the LLM’s performance, on a wide range of linguistic tasks, benefits significantly from using these materials. LLMs generally keep local copies of content to expedite the learning process and provide access to the original dataset for adjustments during the training stage. This content is turned into tokens that, for text-based LLMs, are smaller representations of words in natural language. The tokenization is produced by breaking down words into normalized sequences of characters. Once LLMs map the input text into tokens, they then encode the tokens into numbers and convert word sequences into “vectors” referred to as “word embeddings”; a vector is an ordered set of numbers, you can think of it as a row or column in a table.

Word embeddings are important for copyright (and the GenAI lawsuits) because they preserve the original relationships between words from the original content and form representations (encodings) of entire sentences, or even paragraphs, and therefore, in vector combinations, even entire documents. So, contrary to a prevalent misconception, ingesting text for training LLMs does not deconstruct copied material the way indexing does for search purposes.

Instead, text training for LLMs involves “chunking,” breaking down the material into smaller units while retaining word relationships within these units. This is a key semantic characteristic of LLMs, which facilitates the ability to capture and store the meaning as well as the relationships of sequences of words from natural language. For example, this is how the machine “understands” that the association between “Washington” and “United States” mirrors that of “Rome” and “Italy” even though those words are lexicographically unrelated.

In simple terms, LLMs operate as colossal prediction machines, using training datasets to forecast the “next best word” or other elements, such as musical chords or pixels. It’s like cutting a book into small pieces, each containing a few sentences or paragraphs. These small book pieces are like word embeddings, in that the relationships between the words within those small pieces are maintained. Put differently, despite efforts by LLM providers to avoid reproducing lengthy excerpts from single works, strings of words from ingested works persist in LLMs. This has significant legal implications as both the original and tokenized datasets constitute reproductions, potentially influencing licensing requirements.

How Content Use in LLMs Relate to Copyright

How does this relate to the lawsuits? It relates to copyright infringement claims because making a “tokenized” dataset that LLMs use to create outputs like texts, images, and music (and, later, copying that dataset itself) involves copyright, including the right of reproduction, and can be infringing because LLMs contain unauthorized copies. Looking at the issues in this light, the legal analysis is more straightforward than many accounts might lead one to believe.

In addition to potential infringement at the “input” or learning stage, some LLM outputs will infringe if, for example, they are substantially similar to copyrighted material or are what U.S. law calls a “derivative work,” a way of transforming, recasting or adapting copyrighted material, for example, as with a movie based on a novel. Only if there is a material transformation that provides benefits sufficiently different from the original work would an output become a fair use beyond the copyright owner’s reach.

Many generative AI proponents argue that copyright’s fair use exception uniformly exempts a vast swath of generative AI functions from liability. Fair use, however, is a highly fact-specific inquiry, making it impossible to claim that all imaginable AI uses of copyrighted materials are fair. Supporters of fair use point to a U.S. Court of Appeals for the Second Circuit opinion that found that Google’s digitization of books to make them searchable online and then provide snippets was a fair use.

While the Google books case did address mass copying and then-emerging technology, the court also found that the “more the appropriator is using the copied material for new, transformative purposes, the more it serves copyright’s goal of enriching public knowledge and the less likely it is that the appropriation will serve as a substitute for the original or its plausible derivatives.” The case was also followed by a recent U.S. Supreme Court decision, Goldsmith v. Andy Warhol Foundation, where the Court noted that “a court must consider each use within the whole to determine whether the copying is fair.”

Basically, even if the original copying of a book for machine-learning (say, for noncommercial research purposes) may have been fair, its later use in a different context may not be. The Supreme Court further reiterated that commercial use weighs against fair use and emphasized that uses that substitute for the original work weigh against a finding of fair use. It is also important to note that fair use is not a universal standard; only a handful of countries recognize it, with other countries using different exceptions and limitations that would have to be independently analyzed.

Licensing is the Way Forward

Responsibly and fairly-trained LLMs that use authoritative, trusted content and respect copyright laws and copyright owners will produce better outcomes for everyone. Copies are undoubtedly made in the LLM training process, and copyright laws apply to the copying of protected works. Licensing is the most efficient approach to bringing AI technologies and copyright together. Lawsuits and legislation will take time and likely will not all reach the same conclusion, but licensing can help now by enabling copyright owners and users to agree on how to responsibly use copyrighted works. This includes both direct licenses and voluntary collective licenses, which together can provide a solid foundation for AI systems to continue to innovate.

Image Source: Deposit Photos
Image ID: 6496641
Copyright: stuartmiles

This article was updated after publication to change a subheading for clarity.

Catherine Zaller Rowland Catherine Zaller Rowland is Vice President, General Counsel, at Copyright Clearance Center (CCC) where she oversees the Legal Department and advises on complex issues including copyright licensing, software, professional services, [...see more]

Warning & Disclaimer: The pages, articles and comments on IPWatchdog.com do not constitute legal advice, nor do they create any attorney-client relationship. The articles published express the personal opinion and views of the author as of the time of publication and should not be attributed to the author’s employer, clients or the sponsors of IPWatchdog.com.

Join the Discussion

One comment so far. Add my comment.

Anon
April 10, 2024 01:03 pm
Sorry, not sorry, but this article is wrong as a matter of both technical and legal aspects.

The question must be broken down into two separate aspects – that of training in the build of an AI engine, and that of use (as at least another party becomes involved in a given prompt, post build).

I would suggest that the author’s employment – that of the CCC – is more than a little intrusive here, notwithstanding the disclaimer that this is a personal opinion of the author.