The Open Source Initiative (OSI), a non-profit foundation that advocates on behalf of the open source community, is calling for a clear and defendable definition of how the term, open source, will be applied to artificial intelligence (AI).
In much the same way that exists for other types of software, there needs to be a shared set of principles to define permissionless methods of collaboration between AI practitioners, says Stefano Maffulli, executive director for the OSI.
That definition should be based on the Open Source Definition that the OSI previously defined to ensure licensing terms don’t include stipulations that limit the use of open source code, he adds.
There is no doubt that most AI models are being built using open source components but there is not enough transparency, says Mafulli. That creates a possibility that licensing issue that could make deploying AI models at scale might not be readily apparent, he notes.
More troubling still, the rights to the data used to train the AI model are unclear because in many cases that data is proprietary, adds Maffulli. As a result, when copies of AI models are made it’s not currently clear that organizations or developers that rights to that underlying data, he notes.
For example, large language models (LLMs) may have been trained using data from books that still have valid copyrights. As other LLMs are created using that LLM all the data winds up in yet another LLM. Before too long, multiple LLMs will have been used to create thousands of AI models based on data that was never properly licensed.
That issue becomes even more complicated when medical records that include personal data are used to train an AI model. “There is no clarity,” says Maffulli. “It’s a very new and complex space.”
Microsoft is trying to address some of these concerns by making a commitment to indemnify developers that use its Copilot service to write code from copyright infringement, but generative artificial intelligence (AI) tools is only one use case for AI models. General purpose AI models, in contrast, have been trained using data collected from multiple sources without much regard to copyright concerns.
The concern is that it might take years before copyright issues are resolved so organizations that are using AI models may be assuming a level of risk they might not fully appreciate. In the absence of any effort to govern how AI models are trained and employed, that level of risk only continues to grow every time an employee uses an AI model to create some type of content.
It’s not clear how rights to data will be respected in the age of AI but there are already multiple court cases involving violation of copyrights. At the same time, at least one court has ruled that artwork created using generative AI can’t be copyrighted but that case doesn’t address any issue that arises from the ownership of the original artwork used to train the AI model.
The one thing that is clear is usage of AI models has already moved well beyond established case law, so potential for legal jeopardy is now a very real concern.