Web dev at the end of the world, from Hveragerði, Iceland

An interesting analysis of fair use and generative models

I came across a link to this paper over on Bluesky:

“Generative AI’s Illusory Case for Fair Use by Jacqueline Charlesworth :: SSRN”

Jacqueline Charlesworth, the author, is the former general counsel of the U.S. Copyright Office, so I think it’s reasonable to assume she’s familiar with U.S. copyright law.

It’s interesting to see how the opinions of legal scholars on the applicability of fair use to generative model training has shifted over time as more accurate explanations of how the technology works become more accessible.

For these and other reasons, each of the four factors of section 107 of the Copyright Act weighs against AI’s claim of fair use, especially when considered against the backdrop of a rapidly evolving market for licensed use of training materials.

The conclusion is quite damning.

The fair use case for generative AI rests in part on an inaccurate portrayal of the functioning of AI systems. Contrary to the suggestion that the works on which AI systems are trained are set aside after the training process, in fact they have been algorithmically incorporated into and continue to be exploited by the model.

Converting a work to tokens and then statistically incorporating that token stream into a model in order to capture its expression does not easily fit within the scope of fair use.

This critical distinction between expressive and nonexpressive exploitation sharply differentiates copying to train and develop generative AI models from uses determined to be fair in other technological contexts. Courts in earlier cases have been careful to distinguish between the copying of expressive works to facilitate a functional objective such as searching, indexing or interoperability, which may be deemed fair, and the exploitation of protected expression for its own sake. Examined in this light, the fair use case for mass unauthorized copying by commercial AI entities is revealed as illusory. Appropriation of the world’s literature, art, and music by for-profit companies to generate content from that material—including content that competes with the works so appropriated—is not excused by any precedent of fair use. It is without precedent.

It’s hard to justify unprecedented acts with precedents.

As noted above, the copied materials are converted into standardized formats in order to carry out the training process. This does not negate a finding of infringement, as it is well established that encoding a copyrighted work in a more convenient or usable format is an act of copying that does not itself qualify as a transformative under the criteria for fair use. 93 In an influential case, for example, the court rejected the claim that a service’s conversion of user-purchased music CDs into digital files so the songs could be streamed back to their owners was a fair use of the copyrighted works.

So, courts could conclude that generative models are not fair use without throwing out other forms of programmatic fair use

As explained above, training materials do not disappear as the model is built; rather, each work is algorithmically ingested, piece by piece—or token by token—into the model.100 Nor is there any practice of separating copyrightable from uncopyrightable elements during the training process. The tokens themselves are encoded—not just “statistical data” or “information” about them.101 Of course, this is only logical; the whole point of the training exercise is to capture and map the expressive content of each work for use in the generative process.

Tokenizing and converting the work to better suit training is also not transformative use

This leaves us with the bare claim that AI copying should be considered transformative because it enables the generative capabilities of AI models. This broad contention is untethered to the use of any particular work or works, but instead boils down to an assertion that mass appropriation of protected works is justified because extensive copying is necessary to build and operate such systems. In effect, then, it amounts to a policy argument that the rights of copyright owners must yield to the presumed social benefits of generative AI technology.

A policy argument is not a legal defence when you’re being sued

The narrative being promoted by AI companies and their defenders is that licensing content to train and develop AI systems is “impossible.” But AI companies have demonstrated that they are capable of entering into license arrangements when they see value in the licensed content.

AI companies own licensing efforts undermine their argument that paying for training data is impossible and that copyright owners aren’t losing out on revenue on the works they’re appropriating

Overall, the paper is worth checking out. Most of the language is accessible and doesn’t rely too heavily on legal jargon.

You can also find me on Mastodon and Bluesky