Key Takeaways
Overview
In May 2025, the U.S. Copyright Office released a “prepublication” version of Part Three of its artificial intelligence (AI) report addressing generative AI training (see our Update on part two here). After examining how copyrighted works are used in the development of generative AI systems, the report discusses whether those uses may present prima facie cases of copyright infringement, the applicability of the fair use defense to AI training, and potential licensing models for training data for use with generative AI, if required. The report’s conclusions are discussed in more detail below, but here are the highlights:
- The report finds that some aspects of generative AI likely constitute prima facie copyright infringement, including potentially the AI models themselves (claiming that model weights may in some cases contain copies of training data).
- Regarding the fair use defense, while acknowledging that training AI models on a large and diverse dataset will often be transformative (a key factor in determining fair use), the report concludes that this will depend on how the model is deployed (with models used to generate outputs similar to their training data likely being less transformative).
- “Knowing use” of a “pirated” dataset, while not determinative, is found to weigh against a finding of fair use, whereas guardrails deployed by developers to try to restrict infringing output can enhance transformativeness and favor fair use.
- Models that can generate copies, summaries, or abridgments of the works they were trained on can be a substitute for those works, which can cause market harm that weighs against fair use.
- A new theory of market harm is recognized, based on flooding the market for a particular type of work with AI-generated works, even without a showing that any output is a market substitute for any specific works. Market harm can result from generating works that are stylistically similar to the training data (even though style is not protectable by copyright).
- The report recognizes that questions remain about whether voluntary licensing is feasible and can fully meet the needs of modern AI development but concludes that voluntary licensing should be allowed to develop further for now, and interventions should be considered only if market failures are shown for specific types of works in specific contexts.
It is important to note that the Copyright Office’s position is not legally binding, and a court could disagree with the report. Nonetheless, the report is significant because the Copyright Office determines what gets registered, and its views are often considered persuasive authority. The day after this report was issued, Shira Perlmutter, the registrar of copyrights, was fired, which raises questions about whether significant changes will be made when the report is finalized. However, until caselaw develops or legislation is passed, the report is one of the only sources of available guidance on these issues, and understanding how the Copyright Office views these issues can be important for anyone engaged in developing, deploying, or using generative AI models.
Prima Facie Infringement
The report first looks at how copyrighted material is used in creating and deploying generative AI to determine whether such use constitutes prima facie copyright infringement. It concludes that creating a training dataset “clearly implicates” the reproduction right because it requires copying, and often modifying, training data (even if the data is deleted soon after its use). Training models also implicate the reproduction right, as copies of the data are made in the training process. Thus, the report concludes that if these actions involve the use of copyrighted material without consent, they likely constitute a prima facie case of copyright infringement.
More controversially, the report takes the position that model weights themselves may “contain copies of works in the training data,” and, if so, copying or distributing the models could also constitute prima facie infringement, and the model could be considered a derivative work. Although the Copyright Office acknowledges that the notion of models memorizing training data is disputed, it states that when a model can generate a verbatim or substantially similar copy of its training data without that data being given as a prompt or other input, the reproduced data must exist in some form in the model’s weights.
The report also finds that the use of copyrighted works in retrieval-augmented generation (RAG) may also constitute prima facie infringement. RAG, a technique for enriching model prompts with additional content in order to generate outputs informed by that content, involves making copies of that additional content and inputting that copy as part of an updated prompt. With respect to the outputs of generative AI tools, the report concludes that outputs that are substantially similar to a copyrighted work “likely infringe the reproduction right” and may also infringe the right to prepare derivative works and to publicly display and perform a work, depending on the context.
Fair Use and AI Training
Even if certain uses of generative AI can be shown to create a prima facie case of copyright infringement, defenses may apply, such as the fair use defense. The fair use defense protects certain beneficial uses of a work for purposes such as criticism, commentary, news reporting, teaching, scholarship, or research, and it utilizes a balancing test that looks at four different factors as set forth in the fair use statute.[1] After briefly touching on the sharply divided public comments received by the Copyright Office on fair use, the report applies each fair use factor to AI training.
Factor 1: Purpose and Character of the Use
The first fair use factor, the purpose and character of the use, looks primarily at whether the work is commercial and whether it is transformative (i.e., whether it “adds something new, with a further purpose or different character”). The report acknowledges that in light of the Supreme Court of the United States’ decision in the Andy Warhol case, the various subsidiary uses of a work during the wider training process should be analyzed individually.[2] However, it notes that the specific use at issue should be reviewed in the context of the ultimate use by the developer or user of the AI system.
In analyzing transformativeness, the report initially recognizes that “training a generative AI foundation model on a large and diverse dataset will often be transformative.” However, it notes that the functionality of a model and how it is deployed may narrow a finding of transformativeness. For example, it finds that a model used to generate outputs similar to or aimed at the same audience of copyrighted training data is less transformative, but a model deployed for a nonsubstitutive task, like content moderation or removing distortion from audio, is more transformative.
The Copyright Office also distinguished some earlier cases that found fair use in connection with making “intermediate” copies to reverse engineer video games for the purpose of making competing games.[3] These cases are widely cited to support the notion that the copies made in the process of training models are fair use, but the Copyright Office found “meaningful distinctions” between the intermediate copying cases (where intermediate copies were made in order to access functional material) and generative AI training (where intermediate copies are made to access expressive material).
The report also expressly rejects two other common arguments for AI training being inherently transformative. First, it denies that training is inherently a nonexpressive use of copyrighted works because models sometimes generate or reproduce expressive content. It also rejects the idea that AI training is transformative because “it is like human learning,” noting that human learning differs in material ways from AI training and still must be done within the bounds of copyright and fair use law.
A key focus of the report is the commerciality portion of the first fair use factor, evaluating both whether dataset creation and training is done for commercial purposes and whether such activities are funded from commercial sources. Ultimately, the report decides that commerciality turns on the degree of connection between the alleged infringer’s copying and the furthering of a commercial purpose, and that an actor’s for-profit or nonprofit status is not determinative. Additionally, it states that “knowing use” of a training dataset that consists of pirated or illegally obtained works is relevant to the first fair use factor and, while not determinative, should weigh against a fair use finding.
Factor 2: Nature of the Copyrighted Work
The report spends relatively little time commenting on the second fair use factor, the nature of the copyrighted work, as this factor will be fact-specific. But it notes that a fair use defense will be less likely to succeed when the works used to train the model are more expressive or are unpublished.
Factor 3: Amount and Substantiality
The third fair use factor looks at the amount and substantiality of the portion used in relation to the copyrighted work as a whole (looked at both quantitatively and qualitatively). While copying entire works typically weighs against fair use, this may be mitigated when the use is transformative, necessary for the technology’s function, and when little or none of the original content is made accessible to the public. The report notes that the use of entire works “appears to be practically necessary for some forms of training for many generative AI models” and highlights that the third factor’s weight can be mitigated if the copied material is not made publicly accessible or if effective safeguards are in place to prevent the output of protected content. Ultimately, the report concludes that where AI models can potentially generate outputs that reproduce protected expression, the third factor will partially turn on whether the developer has adopted adequate safeguards to limit the model from reproducing such expression.
Factor 4: Market Harm
The fourth fair use factor examines “the effect of the use upon the potential market for or value of the copyrighted work,” and the report notes that this is often considered the most important element of fair use. The report concludes that where the training enables a model to output verbatim or substantially similar copies that are accessible to users, there is a clear risk of market harm, especially for works or datasets specifically developed for AI training (as unlicensed use could significantly erode the established market for such content). Additionally, the use of RAG systems can also result in market substitution if users rely on the AI-generated content instead of accessing the original work.
The report further explains that in addition to the impact on the market for the specific work alleged to have been infringed, the fourth factor should also consider the markets for works of the same kind and the broader market for works by the same creator. Generative AI outputs can imitate the style of particular writers, which may compete with the original authors’ works and diminish their value in the marketplace. Lost revenue in actual or potential licensing markets is also recognized as an element of market harm, which includes not only established licensing channels, but also those that are “traditional, reasonable, or likely to be developed.”
Finally, in evaluating the fourth factor, some courts consider whether the public benefits of unlicensed AI training (such as augmenting human creativity and enabling innovation) might outweigh potential market harm. The report stated that “the Copyright Office cannot conclude that unlicensed use of copyrighted works for training offers copyright-related benefits that would change the fair use balance, apart from those already considered.”
Ultimately, the report concludes that copying for AI training poses significant risks to the market for copyrighted works, but the degree of market harm in a particular case will depend on whether licensing for AI training is available or likely to develop. Where such licensing options are available to meet particular AI training needs, unlicensed use will generally weigh against fair use. If, however, barriers to licensing prove insurmountable for certain types of works, the fourth factor may favor fair use, as there will be no functioning market to harm.
Licensing for AI Training
The report examines which forms of licensing can best accommodate the interests of both copyright owners and AI companies, to the extent licensing is required. It finds that voluntary licensing—whether negotiated directly or on a collective basis—is already occurring in the AI sector, but questions remain about whether such voluntary licensing is feasible and can fully meet the needs of modern AI development.
The report finds there was little support among commenters for statutory approaches, noting that compulsory licenses have historically been adopted only where Congress determined that the free market was incapable of supporting effective or efficient voluntary licensing, and compulsory licenses should only be enacted in exceptional cases. The report concludes that the licensing market for AI training should be allowed to develop further without government intervention, and targeted interventions should be considered only if market failures are shown for specific types of works in specific contexts.
Similarly, the report rejects an opt-out mechanism that would allow copyright owners to exclude their works from AI training by taking affirmative steps rather than requiring explicit permission before use, finding this approach inconsistent with the current “opt-in” framework of U.S. copyright law, unduly burdensome for rightsholders, and technically unreliable.
Final Report
The prepublication version of the report indicates that a final version will be published soon and that no substantive changes in the analysis or conclusions are expected in the final version. However, the impact of the removal of Shira Perlmutter as the Registrar of Copyright may have on the final report remains to be seen.
Endnotes
[1] 17 U.S.C. § 107.
[2] See Andy Warhol Found. for the Visual Arts, Inc. v. Goldsmith, 598 U.S. 508, 534 (2023).
[3] See, e.g., Sega Enterprises Ltd. v. Accolade, Inc., 977 F.2d 1510, 1518 (9th Cir. 1992), as amended (Jan. 6, 1993); Sony Computer Ent., Inc. v. Connectix Corp., 203 F.3d 596, 608 (9th Cir. 2000).
[View source.]