Organizations of all sizes are asking legal teams to go beyond pure legal risk management and compliance, be more cost-efficient, and manage more cases involving more volumes and more and emerging data sources.
The well-timed arrival of generative AI answers this call to do more with less for many legal teams, but all AI tools and the models on which they run require training to be effective. High quality automation results in high quality training, and both depend on having AI-ready data for the job at hand. Minimally, this means data must be suitable and adequate to properly train an algorithm, as well as properly sourced and procured.
Many legal teams now embrace workflow audits to determine areas of greatest potential to prioritize where and when to roll out generative AI efforts, but these audits all too often overlook data readiness assessments. Even disciplines with sky-high potential for generative AI will fail without AI-ready data and expertise required to effectively and compliantly train the model.
When data is not AI-ready, outcomes can be disastrous. Data pitfalls are already decimating big brand ad campaigns, products, and entire businesses.
Steep fines and “algorithmic disgorgement”
The FTC has built a track record in recent years of policing the use of AI to protect consumers based largely on the concept of ill-gotten data. Although generative AI is much newer, early litigation suggests similar themes will dominate efforts of artists, creators, and other content owners to protect themselves and their craft from a wave of new generative AI tools.
Meta received one of the first multi-billion-dollar data misuse penalties when Facebook was fined $5 billion in 2019 and another $1.3 billion in the European Union for data privacy violations in 2023. Also in 2023, the FTC levied fines for data collection abuses and inadequate disclosures, costing Amazon $25 million for violations related to data used by its Ring doorbells and Alexa assistant and Microsoft $20 million for violations related to children’s Xbox accounts. Fines large and small are just one piece of the negative repercussions.
In a 2021 Yale Journal of Law and Technology article, FTC Commissioner Rebecca Slaughter wrote about the ordered destruction of algorithms. “The premise is simple,” she wrote. “When companies collect data illegally, they should not be able to profit from either the data or any algorithm developed using it.”
Early examples of FTC-ordered algorithm destruction include Weight Watchers and Everalbum. In 2022, Weight Watchers was required to delete all improperly obtained personal information related to children under the age of 13, pay a relatively small fee, and destroy any algorithms derived from the data. Everalbum misused facial recognition, and as part of its settlement with the FTC, the company was required to delete deactivated users’ photos and videos and destroy the algorithms developed with those user photos and videos.
Generative AI prompting similar litigation
Litigation involving generative AI is newer and still playing out, including a copyright case filed against GitHub, OpenAI, and Microsoft related to Copilot. Multiple plaintiffs claim that Copilot, a generative AI code-suggestion tool, “will reproduce publicly shared code in violation of copyright law and software licensing requirements.” The lawsuit alleges that the creation of Copilot relies on software piracy on an unprecedented scale. In another example of ill-gotten data for generative AI, Getty Images sued Stability AI. Getty claims Stability AI illegally used its copyrighted library of images to train Stable Diffusion, a popular AI art tool. The case is set to go to court in the U.K., but images like the following seem to speak volumes. In this case, Stable Diffusion recreated a Getty Images watermark as part of an image
Caption: From TheVerge, an image created by Stable Diffusion showing a recreation of Getty Images’ watermark.
Ensure data provenance for long-term AI viability
Data provenance explains or answers the origin of data, how it was acquired, who owns it, and for what purposes it may be used. Data provenance has long been an indicator of validity when it comes to research data, including medical research. According to the National Library of Medicine, “the purpose of data provenance is to tell researchers the origin, changes to, and details supporting the confidence or validity of research data.”
Likewise, for training algorithms and language models, data provenance helps ensure the validity of data and how effective they may be for training by shedding light on the origins of the data and other details. It also documents where and how data was acquired, and the specific permissions obtained related to data collection and usage.
All data must be properly sourced with adequate disclosures, specific enough to cover all uses. In legal departments and businesses in general, many if not most data collection disclosures predate AI and algorithm training. Although it can be a substantial undertaking, in these situations, data collection disclosures must be revised, re-disclosed, and agreed to again.
Garbage in, garbage out
Generative AI models learn whatever we direct them to learn, and the old garbage in, garbage out adage holds true.
Two potential challenges to avoid:
Data quality: Prior to developing and running an algorithm over a data collection, it is important to understand which types of data are not training capable. For example, non-textual data (image, audio, video, poorly formatted or poorly OCRed documents) may not be suitable for a particular GenAI model and should be set aside.
Poor training data: Biases or errors in the training data (for example, documents in eDiscovery that have been coded incorrectly) can lead to biased outputs from the model, which can be a serious concern when the algorithm is run over a much larger data set. It won’t perform well, and there are huge risks of inaccuracies.
The essential role of humans
It’s important to highlight the synergy between technology and human expertise, and legal professionals play a crucial role in both the algorithm and the AI-generated results, ensuring that they accurately capture the nuances aspects of the task.
To avoid “garbage in, garbage out,” an algorithm should be developed with human expertise and input, and thoroughly tested through a feedback loop before applying it across an entire document population, such as with GenAI privilege logging. Even the most AI-ready data sets will not perfectly train algorithms. A continuous feedback loop with human review and expertise and rigorous quality control throughout optimizes the model and therefore, the results.
With feedback loops in place, AI tools are sure to improve in performance and efficiency over time. As more feedback is incorporated, the need to provide that feedback by making corrections and declines in step.
Additional AI-readiness considerations
Having effective and properly sourced data is important, but it’s also paramount to determine what technologies and partners are worthy of putting that data to use with generative AI. In future AI-Readiness papers, we plan to examine how to determine the AI readiness of law firms, legal service providers, promising software, and existing tech stacks.