The thorny issue of generative AI and copyrights is being raised again, this time through reports that OpenAI transcribed more than a million hours of YouTube videos to train its GPT-4 large-language model (LLM), which would be a violation of the Google-owned business’ terms of service, according to its CEO.

According to a report in the New York Times that cited unnamed sources, OpenAI Greg Brockman and others in the company collected YouTube videos and transcribed them through its Whisper transcription tool to train GPT-4, which underpins the ChatGPT chatbot and other OpenAI services.

In an interview with the Wall Street Journal days before the NYT article was published, YouTube CEO Neal Mohan said that using videos on the service to train AI models would be a “clear violation” of YouTube’s terms of service. Mohan’s statement was in response to a question about whether OpenAI used YouTube videos to train Sora, its hyped video-generation model.

Mohan said he’d seen reports that the videos had been used to train Sora but that he didn’t have first-hand knowledge.

Terms of Service

“We have a clear terms of service,” he said. “When a creator uploads their hard work to our platform, they have certain expectations. One of those expectations that the terms of service is going to be abided by. Our terms of service does allow for some use of YouTube content, like the title of a video or the channel name or the creator’s name, to be scraped because that’s how you enable the open web, for that content to show up in other search engines or what have you.”

However, it doesn’t allow for transcripts or video bits to be downloaded. That said, Mohan said that Google – which bought YouTube in 2006 for $1.65 billion – uses the service’s content to train its own Gemini LLM either within the confines of the terms of service and via individual contracts with creators who upload content to the platform.

In a response to The Verge, OpenIA spokesperson Lindsay Held said the AI vendor uses “numerous sources including publicly available data and partnerships for non-public data.” Held also said OpenAI was considering creating its own synthetic data.

Copyrights and Generative AI

The latest debate highlights the central importance of data to train these LLMs. The NYT article noted that OpenAI in 2021 had run through all the useful data it could find and was turning to such sources as YouTube videos and audiobooks. Demand for training data will only grow and AI companies are beginning to make deals with companies like Shutterstock that hold vast amounts of data.

Apple this month followed other vendors like Google, Amazon and Meta in reaching an agreement with the stock photography company to license millions of photos, videos and music to train its generative AI models, with Reuters estimating the deal could amount to $25 million to $50 million.

Such agreements come as OpenAI and other AI tech companies are being accused of violating copyright laws by scraping data from protected books, movies, photos and other creative works to train their models. Less than a year after OpenAI kicked off the generative AI explosion with ChatGPT in late November 2022, the company was facing lawsuits in federal court filed by authors like John Grisham and George RR Martin of “Game of Thrones” fame and others saying the company was using their copyrighted works without permission.

NYT Among the Plaintiffs

Among those suing OpenAI and Microsoft – which has invested more than $10 billion in OpenAI – is the New York Times, which filed a suit in December accusing the vendors of violating its intellectual property by training ChatGPT on millions of its news articles and adding that the companies’ use of its content to create competitive AI tools is a threat to the Times’ ability to deliver its services.

“Times journalism is the work of thousands of journalists, whose employment costs hundreds of millions of dollars per year,” the Times said in its complaint. “Defendants have effectively avoided spending the billions of dollars that The Times invested in creating that work by taking it without permission or compensation.”

In a blog post the following month, OpenAI said the lawsuit was “without merit” and argued that training AI models using publicly available material is fair use and that they “view this principle as fair to creators, necessary for innovators, and critical for US competitiveness.”

More recently, the estate of George Carlin settled a lawsuit against two podcasters who used generative AI that aired a one-hour special featuring the AI-generated voice and style of the late comedian. At the same time, federal agencies like the Federal Trade Commission are trying to develop guardrails to enforce copyright laws in the new era of generative AI.

Reports circulated earlier this year that OpenAI was negotiating with publishers to license data for training purposes, though an attorney who represents OpenAI in copyright litigation earlier this month said licensing all copyrighted material used to train models would be impossible given the amount of data the models need.