Policy Synthesized from 1 source

Meta Built AI Empire on Pirated Books, Publishers Allege

Key Points

• Five publishers (Macmillan, McGraw Hill, Elsevier, Hachette, Cengage) plus Scott Turow sue Meta
• Llama models allegedly trained on pirated books from LibGen, Anna's Archive, Sci-Hub
• Publishers call it one of the most massive infringements in history
• Lawsuit targets Meta's alleged knowing use of pirate sources
• Outcome will set precedent for AI training data licensing

References (1)

[1] Five publishers sue Meta over AI training using pirated books — The Verge AI ↗

Meta positioned Llama as the "open" alternative to closed AI systems—but the lawsuit filed Tuesday reveals the company built that openness on pirated books it never licensed. Five major publishers and author Scott Turow are suing Meta, alleging the company engaged in "one of the most massive infringements of copyrighted materials in history" by training its Llama models on works stolen from LibGen, Anna's Archive, and Sci-Hub. The plaintiffs claim Meta "repeatedly copied" their books and journal articles without permission, knowingly sourcing infringing material from notorious pirate repositories rather than negotiating licenses.

The publishers bringing the case—Macmillan, McGraw Hill, Elsevier, Hachette, and Cengage—represent a significant portion of the academic and trade publishing industry. Their counsel is not mincing words: this is not a marginal infringement case but a systematic, industrial-scale theft of intellectual property. Author Scott Turow, known for "Presumed Innocent," joins as an individual plaintiff, giving the lawsuit a face beyond corporate balance sheets.

The stakes extend far beyond this single case. Courts have not yet established a clear precedent for whether training AI models on copyrighted text constitutes fair use—a legal gray area that every major AI company has exploited. Meta will likely argue that ingesting books falls under transformative use, the same defense Stable Diffusion relied on in its own copyright battles. But the publishers' brief specifically attacks the "knowingly" element: Meta did not stumble into pirated data. The company allegedly went directly to repositories that had already committed mass copyright infringement, effectively outsourcing its data acquisition to piracy operations.

This matters because it shifts the legal terrain. Fair use arguments become harder to sustain when a company deliberately sourced material it knew was stolen. The publishers are arguing that Meta's choice of sources demonstrates willful infringement—a claim that could expose the company to statutory damages that would dwarf any licensing fee.

Meta declined to comment on pending litigation, but the company has previously maintained that AI training constitutes fair use. The AI industry has largely operated under an assumption that training data is a solved legal question, or at least a risk worth taking. This lawsuit challenges that assumption directly.

What happens next will likely depend on how the court defines "transformative use" and whether Meta's knowledge of the sources' infringing nature matters. If the publishers win, every AI company that scraped unlicensensed data faces similar exposure. If Meta wins, it legitimizes training on pirated data for the entire industry. The outcome will determine whether AI companies must now pay for the books that made them possible—or whether the courts will grant them continued free access to human knowledge.