What Thomson Reuters v. Ross Does and Doesn’t Say About Fair Use and Generative AI

BakerHostetler
Contact

BakerHostetler

The first 24 hours of punditry on Judge Stephanos Bibas’s summary judgment of no fair use in Thomson Reuters v. Ross Intelligence, Inc., Case 1:20-cv-00613-SB (D. Del.), has largely oscillated between predictions that the decision destroys fair use defenses in the pending generative AI copyright litigations, or that the decision is entirely irrelevant to those cases. It is neither of those things.

Bibas obviously spent an enormous amount of time and mental energy trying to get the fair use analysis right. Aspects of that analysis will surely be persuasive to subsequent courts assessing the Gen AI fair use question. However, other aspects are simply inapt in the GenAI context, or courts might find that Bibas got things right the first time in his 2023 summary judgment decision.

And finally, the decision fails to address, never mind distinguish, the intermediate copying case Assessment Techs. of WI, LLC v. WIREdata, Inc., 350 F.3d 640 (7th Cir. 2003), in which a protected database was copied to extract unprotectable facts. The WIREdata case alone—which has not gotten nearly enough attention in the literature assessing fair use in the GenAI cases—undermines Bibas’ conclusion that intermediate copying only applies to computer source code cases.

The decision is not determinative or directly relevant to the GenAI cases…

The AI training at issue in the Ross case is factually distinct from AI training for foundation models. Ross’s AI training occurred on a “small model” basis, meaning that it trained only on the narrow dataset of Westlaw headnotes and other legal material in order to create a legal research search engine.

Large language models (LLMs) and diffusion models, on the other hand, ingest billions of pieces of content and other data that are immediately tokenized and jumbled up to detect patterns of textual or visual creation to create output that is distinct from the training data (putting aside for the moment problems of memorization). In other words, the foundation models that are primarily at issue in the GenAI cases, when they work correctly, are general purpose machines, while the Ross AI model was specifically trained on legal research services content in order to produce other legal research services.

The sheer scale of the foundation model training makes this recent Ross decision immediately distinguishable on the issue of transformativeness. Bibas seems to have acknowledged as much in his comment that “[b]ecause the AI landscape is changing rapidly, I note for readers that only non-generative AI is before me today.”

Bibas’s new summary judgment decision was also presaged in his 2023 summary judgment decision, where he noted that Ross’s use “was transformative intermediate copying if Ross’s AI only studied the language patterns in the headnotes to learn how to produce judicial opinion quotes. But if Thomson Reuters is right that Ross used the untransformed text of headnotes to get its AI to replicate and reproduce the creative drafting done by Westlaw’s attorney-editors, then Ross’s comparisons to cases like Sega and Sony are not apt.” This distinction between AI that learns how to produce new content, and AI that simply replicates training data, will remain an important distinction in the GenAI cases.

Bibas also took a narrow approach to the requirement that the copying be “reasonably necessary to achieve the user’s new purpose” as articulated in Warhol v. Goldsmith. Judge Bibas found that Ross “took the headnotes to make it easier to develop a competing legal research tool.” By contrast, there is little dispute that foundation models need vast amounts of data to function and protect against “data drift” and “LLM collapse.” Foundation model developers have used the “entire internet” not because it made creation of foundation models easier, but because foundation models could not exist without it.

Finally, Judge Bibas, in reconsidering his prior order, came to believe that Ross was acting as a direct market substitute for the Westlaw legal research platform. As I have argued in prior posts, Goldsmith does not do much to support the claims of content owners in the GenAI cases, as “the concern [in Goldsmith] is not over supplanting a theoretical demand for a theoretical work, or for a class of works. Rather, the concern is for specific, granular substitution for specific, granular works.” Ross specifically threatened to act as a substitute for Westlaw and its headnotes.

…but, it is not entirely irrelevant either

Bibas’ opinion will surely be studied by courts grappling with fair use issues in the Gen AI cases, and several aspects of his opinion transcend the factual distinctions highlighted above.

One of those is Bibas’s willingness to assume that Thomson Reuters has a potential market for “data to train legal AIs.” Initially in his 2023 opinion, Bibas found a factual question existed because Ross argued that “Thomson Reuters has never participated—and never would participate—in this market for training data.” In this new opinion, however, Bibas found that Ross had not put on enough evidence to show that the training data market did not potentially exist. This is an enormously complicated question in the GenAI cases. On the one hand, content licensing organizations and start-ups have been scrambling to offer licenses to content-rich data-sets, including those offered by Getty Images, the Copyright Clearance Center, RELX, and start-ups like Created by Humans and Calliope Networks. On the other hand, those licenses are for access to large datasets, not to any one individual work, and the vast majority of plaintiffs in the GenAI cases (with the notable exception of Getty Images) hold copyrights in individual works, not data sets. So how the “market” is defined will be tremendously important.

Bibas’ narrow interpretation of the intermediate copying cases of Sony Comput. Ent., v. Connectix Corp., 203 F.3d 596 (9th Cir. 2000) and Sega Enters. v. Accolade, Inc., 977 F.2d 1510 (9th Cir. 1992) is also potentially relevant to the GenAI cases. Bibas drew two distinctions regarding these cases—first that they involved computer code, which differs from other kinds of copyrighted content, and second that the copying was strictly necessary to obtain access to the underlying ideas and functionality behind the code. Future courts addressing GenAI cases may be swayed by these distinctions, though I would suggest they ought not be. First, it’s not clear why the nature of the copyrighted work should have mattered in the Ross case—Bibas noted in his discussion of factor 2 that the works at issue exhibited minimal creativity, so they would appear to be far from the “heart of copyrightability,” as is the case with source code. Second, Bibas seems to have transformed the “reasonably necessary” standard of Goldsmith to one of “strict necessity.” This would seem to be bad copyright policy, as we should encourage fair uses that make use of otherwise unprotectable ideas and functions, not guard it behind a wall of strict necessity.

More fundamentally, however, Bibas’s sole reliance on computer program cases to distinguish intermediate copying misses a very important intermediate copying case, which brings us to…

WIREdata, WIREdata, WIREdata

In Assessment Techs v. WIREdata, 350 F.3d 640, 644 (7th Cir. 2003), Judge Richard Posner noted in dicta that the extraction of raw data from a database was likely to be fair use. In that case, the defendant wished to obtain access to database software created by the plaintiff to extract data relating to real estate tax assessments. The database software was licensed from the plaintiff to municipalities. The municipalities refused to give defendant access over fear that defendant’s access would violate plaintiff’s copyright. Posner began the decision by noting that the database software owner was attempting to “block access to data that not only are neither copyrightable nor copyrighted, but were not created or obtained by the copyright owner. … It would be appalling if such an attempt could succeed.” Id. at 640.

Although the work at issue involved software, the copyright at issue in the case was “of a compilation format, and the general issue that the appeal presents is the right of the owner of such a copyright to prevent his customers (that is, the copyright licensees) from disclosing the compiled data even if the data are in the public domain.” Thus the questions presented to the Seventh Circuit were “[H]ow are the data to be extracted from the database without infringing the copyright? Or, what is not quite the same question, how can the data be separated from the tables and fields to which they are allocated by [the database program]?” Id. at 643. The court found that the “process of extracting the raw data from the database does not involve copying,” and thus was not an infringement. Id. at 644. However, the court also noted that even if copying were necessary, it would be “intermediate copying” necessary to extract the raw data, and thus a fair use. Id. at 645. The court found that “the only purpose of the copying would be to extract noncopyrighted material, and not to go into competition with [plaintiff] by selling copies of [the computer program].” Id. “[A]ll that matters is that the process of extracting the raw data from the database does not involve copying [data management software], or creating … a derivative work.”

In drawing the distinction between data extraction and infringing copying, Posner drew an analogy to Thomson Reuter’s Westlaw that seems particularly apt in the Ross case, finding that data extraction:

would be like a Westlaw licensee’s copying the text of a federal judicial opinion that he found in the Westlaw opinion database and giving it to someone else. Westlaw’s compilation of federal judicial opinions is copyrighted and copyrightable because it involves discretionary judgments regarding selection and arrangement. But the opinions themselves are in the public domain (federal law forbids assertion of copyright in federal documents, 17 U.S.C. § 105), and so Westlaw cannot prevent its licensees from copying the opinions themselves as distinct from the aspects of the database that are copyrighted.

The 2023 decision described Ross’s use of the Westlaw headnotes in a remarkably similar fashion:

Ross describes its process of transforming the Bulk Memos like this: First, it receives the Bulk Memos in its database. Then, it converts the plain-language entries into numerical data. Next, it feeds that data into its machine-learning algorithm to teach the artificial intelligence about legal language. The idea is that the artificial intelligence will be able to recognize patterns in the question-answer pairs. It can then use those patterns to find answers not just to the exact questions fed into it, but to all sorts of legal questions users might ask.

The logic underpinning Posner’s decision in WIREdata appears directly relevant to both Ross’s AI training and the AI training in the Gen AI cases. The ability to detect patterns, and the ability to get at the heart of how language and creativity work, are not the subjects of copyright protection. Only individual expression is. Perhaps Ross’s intent to create a competing product would have been enough to thwart its intermediate copying defense in any event, but Judge Bibas’s apparent belief that intermediate copying is the sole province of computer code cases is simply not correct. Hopefully future courts give more attention to WIREdata and the lessons it teaches.

[View source.]

DISCLAIMER: Because of the generality of this update, the information provided herein may not be applicable in all situations and should not be acted upon without specific legal advice based on particular situations. Attorney Advertising.

© BakerHostetler

Written by:

BakerHostetler
Contact
more
less

PUBLISH YOUR CONTENT ON JD SUPRA NOW

  • Increased visibility
  • Actionable analytics
  • Ongoing guidance

BakerHostetler on:

Reporters on Deadline

"My best business intelligence, in one easy email…"

Your first step to building a free, personalized, morning email brief covering pertinent authors and topics on JD Supra:
*By using the service, you signify your acceptance of JD Supra's Privacy Policy.
Custom Email Digest
- hide
- hide