
The head of The Linux Foundation today called for the creation of open data sets that providers could use to train artificial intelligence (AI) models.
Speaking today at the Kubecon + CloudNativeCon Europe conference, Jim Zemlin, executive director for the Linux Foundation said that as more data is made available only via licensing fees there is a growing need for high quality datasets that is freely available to organizations building and deploying AI models.
In fact, an Overture Maps Foundation that provides open data that can be used by builders of mapping and geospatial applications and operates under the auspices of the Linux Foundation is a first step toward achieving that bigger goal, says Zemlin.
OpenAI and others have been using data collected from across the Web on the assumption that fair use laws would eliminate any need to license a lot of the data used to train AI models. That issue is now at the root of several pending lawsuits but, in the meantime, more data is being placed behind paywalls to prevent builders of AI models from using that data without having a license. “A lot of data is disappearing behind walls,” notes Zemlin.
The Linux Foundation is not against organizations licensing commercial data but there is also a need for open data sets that can be accessed at no cost, he adds.
It’s not clear what would financially motivate organizations to provide free data for training AI models, but the Linux Foundation is exploring the possibility of setting up an independent body that would potentially serve as a steward, says Zemlin.
In the meantime, more organizations are moving toward training large language models (LLMs) using smaller sets of vetted data to increase the overall accuracy of the output generated. Over time, the AI models created would provide the foundation for creating agents to automate specific tasks. The hope is that it would then be possible to orchestrate multiple agents to automate specific tasks more reliably.
In the meantime, the pace at which more general-purpose LLMs, such as ChatGPT, might be trained will be slowed as more data becomes inaccessible, unless specific licensing agreements are in place. It’s not clear to what degree all those licensing agreements might conspire to increase the total cost of building and maintaining a general-purpose LLM.
One way or another, the total number of LLMs being either accessed via an application programming interface (API) or embedded within an application is about to significantly increase. The reliability of those LLMs, however, will be determined by the quality of the data used to train them, with many creators of data potentially being paid to create content to help train AI models that in turn create additional content.
In the meantime, the one profession that will be generating more revenue than ever are the lawyers engaged in litigating the cases that, at this juncture, seem all but destined to be eventually resolved by one or more Supreme Court rulings.