AI news

The copyright infringement lawsuits against AI companies continue to pile up as authors, visual artists and other creators balk at their works being used to train the large language models (LLMs) that are foundational to popular generative AI tools like OpenAI’s ChatGPT and Google’s Gemini.

Most recently, a photographer and three cartoonists filed a lawsuit last week against Google parent company Alphabet for misusing their copyrighted works to train its text-to-image diffusion model Imagen without their permission. They claim their works were among billions of images that Google used to train the model.

Days later, MediaNews Group, which owns such newspapers as the Chicago Tribune, the Orlando Sentinel and New York Daily news, filed a similar suit accusing OpenAI and Microsoft of illegally using millions of their news articles to train multiple AI tools, including ChatGPT and Microsoft’s Copilot.

MediaNewsGroup is only the latest news organization to sue the two companies, following the New York Times’ lawsuit in late December alleging copyright infringement. According to Reuters, other news sites, such as The Intercept and Raw Story, have filed similar complaints.

“Although OpenAI purported at one time to be a non-profit organization, its recent $90 billion valuation underscores how that is no longer the case,” MediaNews Group wrote in its 98-page complaint. “ChatGPT, along with Microsoft Copilot (formerly known as Bing Chat) has also added hundreds of billions of dollars to Microsoft’s market value. Defendants have created those GenAI products in violation of the law by using important journalism created by the Publishers’ newspapers without any compensation.”

A Brave New World

The lawsuits are only the latest in an ongoing debate about generative AI and copyright laws that will continue to shake out until standards and norms are established. The LLMs that are the underpinning to these popular generative AI products need massive amounts of data for training and pull much of that data from the internet and firms that collect such datasets that the companies can use in the training.

OpenAI, Microsoft – which has invested more than $10 billion in OpenAI – Google, Meta and others have argued that the data their LLMs are trained on is publicly available, with some also noting that without such huge datasets, tools like ChatGPT would be impossible to create.

However, news sites, publishing companies, authors, artists and others have said that their work – their words and images – can’t be used this way without their permission and that the AI companies are profiting off their work while the artists receive nothing.

The plaintiffs’ attorneys in the case against Alphabet, Joseph Saveri and Matthew Butterick, told Reuters in a statement that the case was “another instance of a multi-trillion-dollar tech company choosing to train a commercial AI product on the copyrighted works of others without consent, credit or compensation.”

Courts Become Proving Grounds

Many of the battles right now are playing out in federal courts. The lawsuit against Google by photographer Jingna Zhang and cartoonists Sarah Andersen, Hope Larson and Jessica Fink was filed in Northern California, while MediaNews Group filed its complaint in the Southern District of New York.

That said, the federal government also is trying to wrap its arms around the issue, with the Federal Trade Commission (FTC) in November 2023 saying it deals not only with copyright laws but also issues of competition and consumer protection. The U.S. Copyright Office also is studying the issue.

In their 16-page lawsuit against Alphabet, the artists argue that not only is Google illegally using their works to train Imagen, but also that the LLM becomes able to produce similar works.

“During training of the model, the training images in the dataset are directly copied in full and then completely ingested by the model, meaning that protected expression from every training image enters the model,” they wrote. “As it copies and ingests billions of training images, the model progressively develops the ability to generate outputs that mimic the protected expression copied from the dataset. … These copyrighted training images were copied multiple times by Google during the training process for Imagen. Because Imagen contains weights that represent a transformation of the protected expression in the training dataset, Imagen is itself an infringing derivative work.”

They added that, “Alphabet, as the corporate parent of Google, also commercially benefits from these acts of massive copyright infringement.”

The Debate Over Content

MediaNews Group argued that AI companies will pay for most of the tools they need to create and run their generative AI products, including computers, expensive specialized chips like Nvidia GPUs, electricity, programmers and facilities to house and run the systems. However, they also need high-quality content, with the company noting that OpenAI CEO Sam Altman had testified to the House of Lords in the UK that without copyrighted material, the company’s products wouldn’t be commercially viable.

“Despite admitting that they need copyrighted content to produce a commercially viable GenAI product, Defendants contend that they can fuel the creation and operation of these products with the Publishers’ content without permission and without paying for the privilege,” the company stated in the lawsuit. “They are wrong on both counts, as this lawsuit will prove.”

In addition, MediaNews Group pointed out that the internet has damaged the operating model by using the content and siphoning off ad revenue. The surviving newspapers have spent billions of dollars to pay reporters and editors to write and package the news for both print and online. The AI companies are then taking that content with impunity to create generative AI tools that undermine the publishing companies.

“This issue is not just a business problem for a handful of newspapers or the newspaper industry at large. It is a critical issue for civic life in America,” the company wrote. “Indeed, local news is the bedrock of democracy and its continued existence is put at risk by Defendants’ actions.”