Troveo this week announced it has expanded the types of licensed content it makes available to organizations that train artificial intelligence (AI) models.

In addition to video, the company is now adding five other types of content that it makes available to include audio, text, agentic workflows, gameplay data and data collected from robots.

At a time when most public data has already been consumed by large language models (LLMs), builders of AI models are now looking for a more diverse range of proprietary content to license, says Troveo CEO Marty Pesis. In the absence of being able to readily access that data, the overall pace at which the next generation of AI models can be built is slowing, he adds. “Data is now a huge bottleneck,” says Pesis.

As a clearinghouse for providing that data, Troveo works with publishers and other providers of content to make proprietary data available to builders of AI models. This far, it has paid out more than $20 million to thousands of content owners after building a library of eight million hours of licensed video, says Pesis.

The gap that Troveo is filling exists because builders of AI models typically don’t have a relationship with what has become countless providers of content that exist on the Web. Instead of negotiating with each content provider, Troveo enables builders of AI models to license data without fear of being sued later on for copyright infringement.

Additionally, Troveo works with content providers to clean data sets before they are made available to builders of AI models. That approach provides the added benefit of encouraging providers of content to work with Troveo to create material that providers of AI models are most interested in using for training AI models, notes Pesis.

In the meantime, there is now a lot more data being generated using AI. Much of that data is ultimately used to train the next generation of an AI model. However, builders of AI models are looking to license proprietary data to create extensions to so-called frontier models that they can optimize for specific use cases, says Pesis. Much of the data being sought by, for example, an enterprise organization is much more narrowly focused, noted Pesis.

Hopefully, there will come a day when accessing proprietary data will be a lot less contentious than it has previously been. Providers of frontier models, in many cases, have settled several copyright infringement lawsuits, while at the same time striking licensing deals with publishing companies and film studios.

Regardless of how the data is obtained, it’s clear there is now, in general, a greater appreciation for the data engineering workflows that need to be set up to effectively train AI models. The challenge is that many of the providers of the content that might be used to train an AI model don’t typically have a lot of expertise needed to ensure that the data they might make available is actually worth being consumed by an AI model.