Mike Vizard: Hello and welcome to the latest edition of the Techstrong.ai video series. I’m your host, Mike Vizard. Today we’re with Stefano Maffulli, who’s the executive director for the Open Source Initiative, and we’re talking about an effort here to define what it means to be open in the age of AI. There’s a lot of conversations going on about this and certainly not a lot of clarity. Stefano, welcome to the show.
Stefano Maffulli: Thank you. Thank you, Michael. Happy to be here.
Mike Vizard: So what exactly is the challenge we’re facing here? Because there are all these AI models out there. Some of them are outright black boxes. Some of them are kind of half open and some of them are completely open, but how do we know what open is?
Stefano Maffulli: That’s exactly it. There is no clear understanding and everyone, every actor in the space, is rushing to have some sort of definition and share the agreement. There is even a law proposal coming out of the European Union that is defining what is allowed in the AI space, machine learning space; and there is a special provision for free and open source AI, but there is no definition, there is no clarity. So we are trying to help the community come together and share the values and the principles that they want to see represented in this space, the same way that open source has been defined and maintained, and we maintain that definition for what open source means for the community. We want to have the same thing for the AI machine learning space.
Mike Vizard: What is the core issue? Because it seems like some people are building AI models and they have their large language models that they use to build that, but I can’t really see what’s going on inside that LLM and what data that they use to train it. So is that too opaque or do they need to kind of tell me exactly how they train the model, so I know how it’s open source, or are they just saying, “Well, we’re making it available to anybody, but we’re not telling you how we made it?”
Stefano Maffulli: Yeah, well that’s partially what’s happening. Definitely. It’s a very new and complex space. The very first thing that is clear and was clear to us at the Open Source Initiative is that the legal framework is not the same for machine learning space versus the software space. In software we used to deal with basically only copyright, a little bit of patents here and there, but it’s fairly simple. With machine learning and AI, there is data, you mentioned it more than once. Data has a completely different aspect to it. Definitely there is some copyright aspect to it, but there is also privacy and other regulations that are affecting how data is collected and assembled, distributed, shared and copied over by, collaborated on. So it’s really identifying one of the challenges. The first challenges that we have is to identify what we want to do with the data space.
How do we want to enable collaboration on that front and innovation and the other thing is that for software systems we’ve been having the same sort of definition, the same sort of principles apply to the whole stack sometimes from the hardware to the whole operating system, we have the same sort of principles of what is open. We can easily identify freedom for the developers to run software, to modify it, make copies and share them, apply at the kernel level, the driver’s levels, the compilers levels, the application levels. It’s all the same. When you look at the machine learning system that is made from data sets, models, weights, the software for the training, the inference and testing and all of those components, you don’t have the same sort of framework. We need to understand open in the different contexts and apply this concepts to enable freedom and self sovereignty if you want to developers and users.
Mike Vizard: Is there any way to maintain some sort of sense of central control, because one of the issues you might run into is people will take an LLM, they’ll train it, they’ll add their data to it, and for all intents and purposes, at least to me, it looks like a fork. How do I keep all the benefits of the LLM community together if people are essentially taking every instance of an open source LLM and customizing it to their heart’s content?
Stefano Maffulli: This is what’s happening too somewhat in software, and there are ways and forks and definitely the market defines or decides and the marketing users decide which ones are the more appropriate for wider use cases and builds momentum behind it. Look at the Kubernetes communities – wide and large and a lot of forks if you want, but many of the collaborations, a lot of the collaborations go in the same direction. With the AI space being so brand new, honestly, I can’t predict where the users and the communities will go. They will take it wherever they want, but the most important thing is to realize, to have a common understanding of what we need to do in order to achieve the same level of innovation without having to constantly reinvent the wheels and to enable small players to enter the space rather than having only the large established conglomerates to rule it.
We need to have ways and understanding for society to control and understand what’s happening. You mentioned the black box issue. We need to have a level of transparency also clear in this space.
Mike Vizard: Will there be some expectations about giving back to the community? I think one of the issues we have with open source in general these days is there’s a lot more people consuming it than contributing to it, and that creates all kinds of interesting security and financial issues down the road. How can we maybe learn from some of the things that we have seen in the past and not repeat them here?
Stefano Maffulli: Very interesting question. Honestly, I’m not sure I buy fully into that narrative that corporations, large corporations, are taking without giving; I’m not sure. I would love to see more analysis and science behind it rather than just gut feelings; but in general, I agree with you that there must be a way for the concepts of maintaining the reciprocity like I give you something, I expect that you do the same with others down the road. This was established, this concept of turning copyright on its head has been established 40 years ago actually. It’s going to be the celebration of the new operating system and that copy left concept has been forgotten in the past 10 to 15 years I think; and I do believe that the community, I’m expecting the community, to come up with a solution also in the open machine learning space to preserve the freedoms and the self sovereignty that users have down the stream.
Mike Vizard: Do you think ultimately we’ll see multiple types of licenses that define different use cases, and that’s kind of how we’ll eventually wind up navigating all this, or can there be one license for all?
Stefano Maffulli: No, it’s hard to predict, but definitely there is going to be, I’m assuming that there’s going to be, a variety of approaches, each privileging one other business cases or giving other incentives. I’m expecting to have different things specifically because there is no single legal framework in this area, so we’ll have to navigate a lot more in difficult waters.
Mike Vizard: If I go back in history and I look at software early on in any given segment, proprietary offerings dominated and eventually gave way to open source platforms as the mystery of the thing became less difficult to ascertain. Are we going to see the same thing here with AI models? I mean we’re all amazed with what they do today, but upon further review, it looks like it’s pretty straightforward data science.
Stefano Maffulli: Indeed. Honestly, again, it’s hard to predict, but we are already seeing an emergency of new models that are released with more permissions rather than obstacles. There is going to be more, I think there’s going to be a push to have more open, if not because of competition. It’s one way of accelerating development, accelerating, getting new products to market quicker, and I think there is an incentive to have more openness.
Mike Vizard: So you don’t think we should be worried about a handful of small companies dominating the whole category and having a replay of that Web 2.0 experience that we’re all far too familiar with?
Stefano Maffulli: I’m not ready to lift the… I’ll be optimistic on that front. There is still the risk because there are three main areas that favor large corporations. One is the accumulation of data, and this is probably the most important one. The accumulation of data is in the hands right now of very few corporations and the lawsuits that many small authors and small editors, publishers are bringing to these large corporations seem to have unintended consequences. People are complaining that LLaMA, for example, Facebook or Meta’s language model has been trained on a dataset that includes a collection of books that is still under copyright. So technically they have copied material that is still under copyright and they do not have a license for, but the thing is that these books also, LLaMA could have licensed them. They have the power and the money to go to the publishers and with a decent amount of money they can get the license.
Now, the same dataset that has been used by LLaMA is also used by smaller groups like nonprofits, like EleutherAI, but also researchers. Now these researchers, without having access to that dataset that contains copyright material, they will not have the money to buy the license or get permission to run and train their models. So they will be put at a disadvantage. We are in a new space and the European Union and Japan, for example, they have moved in a different direction where they have allowed, they have created a new right inside the copyright law, right to data mining and that is in Europe, for example, the right to data mining is granted without having to request permissions for nonprofit research, for example.
So the regulation is moving in other parts of the world in a different direction than the lawsuits in the United States. We’ll see which approach will prevail.
Mike Vizard: So ultimately, do you think that the AI tail may wind up wagging the proverbial dog because we’ll get all kinds of changes to how content is going to be used and data is going to be used and managed and owned, and we’re going to have a deeper conversation than just about the AI model itself?
Stefano Maffulli: I think so. It’s already happening. We started; the Open Source Initiative has assembled a very large and wide diverse group of people to help craft this definition or openness in AI, and this group comes from a disparate variety of experiences. We go from Mozilla Foundation and creative commons or internet archive. For example, organizations that work in the open but also includes OSS Capital, venture capitals and large corporations, small corporations in the AI space, GitHub, GitLab and Open Weaver, Sourcegraph organizations are already working in a generative AI space, for example and we are thinking already noticing how much complexity there is in the space to many laws with copyright and copy left, we had it easy because copyright, there is an international agreement, the Berne Convention, and pretty much everywhere in the world, every copyright law is based on the Berne Convention. The same doesn’t happen with data. Data, like I said, just think about the medical records and how they’re treated differently in different parts of the world.
Mike Vizard: So ultimately, what’s your best advice to organizations? We’ve already seen some of them have opened up an office for managing their open source software relationships around the world, and they’re partnering with different organizations. Does the same thing need to play out here? I mean, and who’s going to lead that effort within an organization?
Stefano Maffulli: Most likely the chief data officers, and we’re probably going to see a lot more in the privacy space especially, we’re going to see the need to increment the collaboration, I think between privacy officers and CTOs and CIOs, chief security officers too will have to be talking a lot more because the implications, as I mentioned, they’re very deep. Once you start putting data inside a model and then you release that model, there are techniques to exfiltrate private data from models. So there are levels of regulation that needs to evolve. There is technology that needs to evolve, the threats models need to evolve and adapt and ultimately I think that we need to start thinking as a collective society about the benefits, what do we want the technology to do for us, rather than being a little bit pushed by large corporations with their own agendas.
Mike Vizard: All right, folks, you heard it here. There are agendas. They are competing and there’s probably still more unknown than known at this point. Stefano, thanks for being on the show.
Stefano Maffulli: Thank you.
Mike Vizard: And thank you for all watching the latest edition of the Techstrong.ai video series. You can find this and others on our website, Techstrong.ai. Until then, we’ll see you all next time.