Adobe, a major player in the tech industry, is now facing a proposed class-action lawsuit alleging the misuse of copyrighted materials to train its artificial intelligence models. The complaint claims the software giant, known for its extensive AI services like Firefly, utilized pirated books to develop its SlimLM program, marking another significant legal challenge in the evolving landscape of AI training data.
Filed on behalf of Oregon author Elizabeth Lyon, the lawsuit specifically targets Adobe's SlimLM program, a small language model. Lyon asserts that pirated versions of numerous books—including her own copyrighted works—were illicitly used as training data for SlimLM.
Adobe describes SlimLM as a series of small language models designed for "document assistance tasks on mobile devices." The company stated that SlimLM was pre-trained on SlimPajama-627B, a "deduplicated, multi-corpora, open-source dataset" released by Cerebras in June 2023. However, Lyon's lawsuit, initially reported by Reuters, contends that SlimPajama-627B is a derivative of the RedPajama dataset, which itself incorporated the controversial Books3 collection. This, according to the complaint, means her copyrighted non-fiction guidebooks were part of the training data used for SlimLM.
The "Books3" dataset—a huge collection of 191,000 books frequently used to train generative AI systems—has been a consistent flashpoint for copyright disputes within the tech industry. The RedPajama dataset, from which SlimPajama is allegedly derived, has also been implicated in multiple legal actions. For instance, in September, a lawsuit accused Apple of using copyrighted material, including data from RedPajama, to train its Apple Intelligence model without consent or compensation. Similarly, Salesforce faced a lawsuit in October with comparable allegations regarding its use of RedPajama for training purposes.
These legal challenges have become increasingly common across the AI industry, where algorithms often rely on massive datasets that allegedly contain pirated or copyrighted content. A notable precedent was set in September when Anthropic agreed to pay $1.5 billion to a number of authors who had sued the company for using pirated versions of their work to train its Claude chatbot. This settlement underscored the growing legal risks associated with AI training data and intellectual property rights.








