Authors Sue Salesforce for Using Pirated Books to Train AI Models

Salesforce Inc. faces a new class action lawsuit accusing the software giant of secretly training its artificial intelligence (AI) models on pirated books, then scrubbing evidence of its sources once the practice drew scrutiny.

The complaint, filed in federal court in San Francisco on Wednesday, claims Salesforce used hundreds of thousands of copyrighted books to develop its XGen series of large language models (LLMs), relying on datasets RedPajama and The Pile.

According to the lawsuit, a collection known as Books3 containing more than 196,000 books illegally copied from the private tracker Bibliotik formed the backbone of the company’s AI training.

Authors E. Molly Tanzer and Jennifer Gilmore brought the suit under the Copyright Act, claiming Salesforce “continues” to infringe by maintaining copies of the datasets on its systems.

When Salesforce launched XGen in June 2023, company engineers openly linked to both datasets on GitHub. By September of that year, however, the references vanished. In their place, the company posted vague descriptions of “natural language data” culled from “publicly available sources.”

Hugging Face, the platform hosting Books3, removed the dataset the following month, citing copyright complaints.

The lawsuit alleges the cover-up extended further. Salesforce initially disclosed that it used The Pile to train CodeGen models in 2022, which later became the foundation for its commercialized Agentforce AI platform and the XGen-Sales model released in October 2024. Two months later, the company scrubbed charts and references to “RedPajama-Books” from its documentation, replacing them with vague language about a “mixture of publicly available data.”

By December 2023, Salesforce was claiming its models used a “legally compliant dataset” with no mention of RedPajama.

Legal experts caution that the plaintiffs face an uphill battle: Authors, they contend, must demonstrate actual financial harm, and not merely prove their work was used for training.

Recent court decisions have favored AI companies in comparable cases. Judges ruled OpenAI and Anthropic’s accusers failed to prove market harm. One judge criticized Anthropic, however, for maintaining “a permanent library of pirated books.”

The authors’ complaint includes ammunition in the form of statements from Salesforce CEO Marc Benioff. In a January 2024 Bloomberg interview, Benioff acknowledged that AI companies had “ripped off” training data, saying that “all the training data has been stolen.”

The lawsuit seeks class certification covering all U.S. copyright holders whose works were used since October 2022. The authors are demanding statutory damages, destruction of the infringing datasets, disgorgement of profits, a willful infringement declaration, and attorneys’ fees.

Watch Video