Navigating Social Media Discovery and Generative AI in the OpenAI ChatGPT Litigation

Association of Certified E-Discovery Specialists (ACEDS)

Introduction

Each week on the Case of the Week I choose a recent decision in ediscovery and talk to you about the practical applications of that case and what you need to be thinking about as you conduct discovery of ESI. 

Let’s dive into this week’s case, which comes to us from the very high profile In re OpenAI ChatGPT Litigation. This decision is from May 2, 2024 and authored by United States Magistrate Judge Robert Illman. The issue tags for today’s decision are Failure to Produce, Possession Custody Control, and Social Media. This is a relatively short opinion, but an important one for a couple of reasons: it involves one of the first discussions of the generative AI tool ChatGPT and we are all anxiously awaiting discovery case law on this issue, and it allows for the discovery of the identities behind social media accounts using pseudonyms.

Background

This is the second decision in this matter in our case law database. We are before the Court on a dispute regarding two requests for production of documents. Plaintiffs here, a group of authors, allege claims for direct copyright infringement and unfair competition against Open AI following Open AI’s use of their copyrighted works to train its Large Language Model (“LLM”), ChatGPT.

Crucial to the underlying case is an understanding of how ChatGPT works. ChatGPT is an LLM created, maintained, and sold by OpenAI. An LLM is a type of artificial intelligence (AI) program that can recognize and generate text and conduct other tasks after being trained on huge sets of data. In essence, an LLM is a computer program that has been fed enough examples to be able to recognize and interpret human language or other types of complex data. [1] ChatGPT is an LLM that has been trained on data scraped from the Internet, including thousands or millions of gigabytes’ worth of text. Plaintiffs here object to OpenAI’s use of their copyrighted works as part of the text used to train ChatGPT. At issue here are two versions of that LLM, GPT-3.5 and GPT-4.  

ChatGPT’s current strength is the summarization of content based on a user question or statement asking the LLM to do something, called a prompt. Since its release in March 2023, businesses have integrated ChatGPT into various applications to leverage that functionality to increase productivity. In fact, shortly following its release, eDiscovery Assistant began working on an implementation and as of September 2024, provides AI generated summaries of each of our more than 36,000 decisions leveraging ChatGPT’s summarization capabilities.

Plaintiffs here allege that because a large language model’s output is reliant on the material in its training dataset, “[e]very time it assembles a text output [in response to user queries], the model relies on the information it extracted from its training dataset,” which results in ChatGPT sometimes generating summaries of Plaintiffs’ copyrighted works and benefiting commercially from the use of Plaintiffs’ and Class members’ copyrighted works.

Facts and Analysis

This dispute is before the Court on letter briefs and revolves around two interrogatories propounded by plaintiffs, ROG 12 and ROG 14. 

ROG 12 sought information about the “identity of social-media usernames of certain current and former employees of Defendants who used personal social-media accounts to communicate on the subjects of the litigation.” Plaintiffs sought an order from the Court directing the defendants to 1) find out whether any current employee or board member has used any of their personal social media accounts to discuss anything relevant to this litigation, and, if so, to produce those individuals’ social media usernames; and (2) to produce the social media usernames of past employees or board members to the extent they are known to Defendants. Plaintiffs argued that Defendants’ directors and personnel may operate some of their personal social media accounts under a pseudonym, making it difficult or impossible for Plaintiffs to uncover discussions relevant to this action on those accounts. As support for their argument, plaintiffs point to Elon Musk, a co-founder and early investor in OpenAI, who admitted in a recent deposition in a separate action that he operates two Twitter accounts under a pseudonym.

Defendants argued that the request was overbroad, not tailored as plaintiffs’ suggest, that OpenAI does not systematically collect information about personal social media accounts from its employees and Board Members, or monitor those accounts in the ordinary course of business, and that the information requested was not in its possession, custody or control. Strangely, OpenAI argued that the social media information was not within their control because plaintiffs “have identified no legal basis for OpenAI to demand its employees and Board members turn over [username] information about their personal [social media] accounts.” I’m pretty sure that whether or not there is a legal basis for providing the information has nothing to do with whether OpenAI has “control” over it as that term is defined under Rule 34 of the Federal Rules of Civil Procedure.

Unsurprisingly, the Court found that the burden of actually asking OpenAI’s current directors and employees whether they have engaged in any discussions using their personal social media accounts was not burdensome and proportional to the needs of the case. Per the Court, if all current directors and employees report that they have engaged in no such discussions on their social media accounts, then defendants should certify that to plaintiffs. If anyone answers yes, the Court ordered defendants to gather and disclose that person’s relevant social media username(s). As to past employees, the Court ordered defendants to produce the social media usernames “of any such persons if Defendants know, or learn, that any that such persons have engaged in discussions on social media that might be relevant to claims or defenses in this case, and the social media username(s) of such persons are known to Defendants.” 

ROG 14 sought information about individuals and entities who possess or have possessed stock or ownership interests in OpenAI greater than five percent. Plaintiffs’ basis for the ROG was to identify anyone that may have relevant discovery as well as their ability to “respond to a judgment” if there is one. 

But the Court found that plaintiffs request was too speculative and that they failed to show how anyone owning more than 5% of the shares of OpenAI “actually would have (rather than could have) any relevant documents or information, or that they actually (rather than might have) sought to exert influence or voice concerns about OpenAI’s relevant business decisions.” The Court also rejected the idea that the identity of those individuals had a bearing on the Company’s ability to pay a judgment — the “identity of a company’s shareholders does not appear to be something that would shed any light on the company’s financial condition, or the company’s ability to respond to a judgment.”

Takeaways

One of the keys to a quick resolution of this motion is that the Court had the parties file letter briefs rather than going through a protracted motion schedule. We are seeing that more and more with courts on discovery motions, and it is an excellent way to keep cases moving forward. Be aware of your court’s motion practice, and be ready to adhere to it. That means your arguments need to be concise and on point, with no fluff, and you need to be able to present them quickly as those procedures have very short turnaround times. 

What struck me as key in this decision was the potential importance of statements made on social media, and that accounts listed under pseudonyms were implicated. That’s a novel issue in discovery. This raises a host of issues regarding the collection, authentication and presentation of social media that requires early planning and consideration as well as tying an individual to that account. All of those issues need to be considered early on to ensure the process for handling that data is done in a manner that allows for it to be admissible and presented effectively to a judge or jury. 

We are also still seeing parties making discovery requests that they can’t defend effectively. The request for social media identities here had merit, but the request for names of investors in OpenAI seemed doomed from the start. Speculation is never going to be sufficient to get you discovery. Provide a factual basis for why you need what you are asking for. Plaintiffs did it with Elon Musk on the social media issue, why not do the same on the requests for identities? Failure to do so looks more like you don’t have the basis required to demonstrate relevance, just as the Court found here. 

Finally, one of the most interesting considerations in this decision is whether or not it implicates generative AI, and whether we finally have a decision that allows us to add an Issue Tag for Gen AI in eDiscovery Assistant. After much internal debate, our answer is not yet. While the crux of this case is absolutely about generative AI in that ChatGPT is potentially using the plaintiffs’ copyrighted works to provide answers to users, this dispute and decision from the Court does not directly involve the discovery of data from a Gen AI tool. We’ll be keeping our eyes open for future decisions in this matter that may involve the discovery of Gen AI data.

[View source.]

Written by:

Association of Certified E-Discovery Specialists (ACEDS)
Contact
more
less

PUBLISH YOUR CONTENT ON JD SUPRA NOW

  • Increased visibility
  • Actionable analytics
  • Ongoing guidance

Association of Certified E-Discovery Specialists (ACEDS) on:

Reporters on Deadline

"My best business intelligence, in one easy email…"

Your first step to building a free, personalized, morning email brief covering pertinent authors and topics on JD Supra:
*By using the service, you signify your acceptance of JD Supra's Privacy Policy.
Custom Email Digest
- hide
- hide