Synopsis: In this episode of the AI Leadership Insights video series, Mike Vizard speaks with Matthieu Jonglez, VP of technology for application and data platforms at Progress, about the difference between retrieval-augmented generation and fine-tuning.
Mike Vizard: Hello and welcome to the latest edition of the Techstrong AI Leadership video series. I’m your host Mike Vizard. Today we’re with Matthieu Jonglez, who is vice president of technology for application and data platforms at Progress. And we’re talking about the difference between retrieval-augmented generation, known as RAG and fine-tuning. Matthieu, welcome to the show.
Matthieu Jonglez: Thank you.
Mike Vizard: Walk me through this a little bit. People can fine-tune an existing large-language model, or they can take their data and use some sort of vector database or some other technique and customize it that way. What’s the difference between those approaches and when might I use one for what?
Matthieu Jonglez: Extremely [inaudible 00:00:55] question indeed. There are several considerations actually that you need to take into account. One is, of course, a use case and whether you need a precision of a fine-tuned approach or whether you need the readiness and availability of data. Because we’ve seen it in all of the OpenAI models and Mistral, in all of those models, training takes time. And training is complicated, requires fair amount of investment. And if you need up-to-the-minute data, up to the second data in some use cases, you can’t rely on training. You need to inject data at runtime for it to be interpreted by your model. So that’s already a very big consideration. They said, fine-tuning is some very relevant and verifiable use cases for specialized data engineering content and so on and so forth. We can explore the use cases if you want, but I think another very important aspect in decision-making process is security.
Is the information that you’re going to use for your training, for your fine-tuning, is that readily available information to all of your employees? Is it knowledge that should be accessed by everyone, especially more so when it’s for general public consumption? I think there are interesting considerations and very key considerations when using private and enterprise data into what aspects of the data can be consumed by different user groups. So, having the ability in a RAG-type environment to apply real-time data security. What document can be seen by which audience, what field of that document or that field of that record can be seen? Should it be redacted? All of those considerations are really important decisions that you need to make.
Mike Vizard: I also get the sense that fine-tuning requires a lot more skills than using a vector database to essentially extend an existing LLM.
Matthieu Jonglez: It definitely does. In order to fine tune a model, you need properly trained AI practitioner, you need a machine learning operations team and you need to try different approaches and fail a few times and learn on the way. RAG is definitely a lot safer. But let me correct you on one thing here, which is twice in a row here you’ve mentioned RAG is vector-based. Actually, we’re seeing multi-phase approach in RAG strategies coming up more and more, and with much better results as well. Combination of search and vector calculation, vector similarities, in our product stack, in the POCs we’ve done and the process that we’ve put in place and the product features we’re building, we’re really combining those two things. A natural semantic driven search, knowledge graph search initially, to deal with the precision and vector similarity to deal with recall. There’s interesting combinations here that will lead to better performance.
Mike Vizard: Well, expound on that a little bit because some folks are saying, I need a dedicated vector database, and other folks are adding a vector capability to their existing database. And now you’re describing search as another mechanism for that. Are these some sort of continuum or do I not need one approach if I have the other?
Matthieu Jonglez: It’s a form of continuum. And I think if you look at the economics of it, and FinOps is definitely a really big topic in cloud environments these days. If you look at the economics of it, if you implement it as a dedicated vector database, you end up creating a new data silo. So, you have your information systems, you might have search engine or a data hub or a data fabric of some type, ideally a knowledge graph even, then you’ve got your vector database and you’ve got your link chain, your retrieval pipeline, et cetera, et cetera. So, different systems, information being cloned, different security profiles, very complex environment to manage and architect.
The approach we’ve taken in progress is, given our MarkLogic product, given our Semaphore product, given our Corticon products, we’re combining all of those things to offer a data fabric implementation, a data fabric architecture with knowledge graph, with search capabilities. We are investing in vector capabilities. And the idea is to consider AI and consider generative AI in particular as one additional distribution channel for the data that we manage in a data fabric. And I think if you consider the different governance patterns, implementations, data mesh versus data fabric distributed data governance, et cetera, that approach makes a lot of sense for enterprise to grow. A dedicated vector database and bring data in that context for lab exercise for small projects makes sense. If you want to start spanning that to the entire enterprise data ecosystem, you need something that scales, something that brings in security, something that is economically viable as well.
Mike Vizard: Who’s in charge of this? And I asked the question because data scientists are hard to find, but as we move to this different techniques for exposing data to the LLM, is this becoming more a function that can be managed by developers and database administrators and DevOps teams working together? Or do I still need a data scientist for that particular task?
Matthieu Jonglez: Very good question. And rather than talking about developers and data scientists, et cetera, I want to actually reframe that in terms of data stewards and data citizens. The reality is, given the distributed governance we want to put in place and given the distributed nature of the data consumption, we are seeing actors at both hands of the spectrum. They’re caring very much about the data, about the interpretation of the data, and we are seeing the data ecosystem move from a application-centric type of architecture to data-centric architecture.
So, using the data, keeping the data where it lives is no longer really viable because you need to think about the knowledge graph, think about an object model essentially that spans different system. Yet the knowledge about what data means, wherever it’s produced, lies with the data steward of the system. We can push their interpretation in a data fabric. Likewise, if you’re building a generative AI product, or in fact any digital application, you don’t want to spend your time reverse engineering the data coming from the different systems. You need a way of having some assurance of the data quality that you’re inheriting from a data hub, a data mesh, a data fabric. So that interplay between data consumers and data producers is really changing and has been changing for the last, probably two or three years, when the costs and the maintenance effort associated with application-centric design started to really explode.
Mike Vizard: Are we getting more mature in the way we manage data? Because historically, a lot of organizations were not so good at data management and many of them would not have gotten a good housekeeping seal of approval. There’s lots of conflicting data, shall we say? Some of it might even be outright wrong. Is AI forcing us to kind of go back and look at all of those processes?
Matthieu Jonglez: It’s completely changing the way many organizations think about the data, in very good ways, in very positive ways. Data management, data governance was always a second-class citizen in many organizations. Now, we have practitioners on the ground who really, really defending their trade and they were our advocates and they’ve been our advocates for years. But they were always struggling to get their voice heard.
The interesting thing about AI is it’s a fear of missing out of many decision makers. They can’t think about a project without AI these days. And everyone agrees that to get results out of AI systems, you need good data. To get good quality data, you need good governance. And all of those governance projects, all of those data management projects are getting a lot of scrutiny, a lot of interest, a lot of funding, because of that AI initiative coming into play. So that’s a really interesting time for those practitioners.
And coming back to your previous question, where does that leave our data scientists? Well, somewhere in the middle between our data citizens and our data stewards, because those two groups are really good at consuming data and producing data, but who’s interpreting data? So that’s a very important role here as a matchmaker and broker of knowledge. I think we’re seeing the role of the data scientists as essentially a knowledge creation role.
Mike Vizard: What titles are associated with being a data steward? Because we see things like chief data officers, we see data engineers, we see database administrators, we’ve seen all kinds of things. So is that changing and if so, how? Or is it just going to always be something of a mismatch?
Matthieu Jonglez: It’s always going to be something of a mismatch. And in reality, the majority of data stewards are actually business analysts, people who understand the data, understand the application, understand how the data came about and can explain that.
Mike Vizard: What will it take therefore to democratize AI? Because one of the issues we’re going to have going forward is people ultimately want to interact with things without requiring so many specialists in between them. So can we get to some layer of abstraction around all of this that allows us to have the quality of the data remain the same, but the whole thing becomes more accessible to the consumers?
Matthieu Jonglez: And we’re already there to a large degree. There are very interesting tools for improving data quality management, et cetera, et cetera. And in the progress ecosystem, we’ve got a number of tools, MarkLogix, MarkLogic, Corticon, et cetera, that are really playing important roles in that regards, and I’ve been playing those roles for years. What’s actually changing in the way we’re seeing data quality being talked about is what is the data readiness for AI projects? Because it’s subtly different. It’s subtly different, it relies more potentially on knowledge graphs. It relies more on semantic capabilities than previous generation of search projects or digital initiatives [inaudible 00:12:19].
And that’s actually really interesting because the complexity of all the challenges, I suppose, of managing that data are not actually linked with AI per se. They’re actually linked with proper understanding of your data, proper leveraging of your data. A lot of organizations have been complacent about unstructured data or semi-structured data when doing analytics on relational data, which is only a small fraction of the information ecosystem. So, leveraging AI actually means leveraging all of your data often, and that forces them to revisit, hold on, how much of my data am I really using, am I really getting value from?
Mike Vizard: Well, what’s your advice to, what I would call, the data newbies who discovered AI and now they’re discovering all the nuances of data management? And a lot of them, to be honest, when you talk about this kind of feels like they’ve just arrived from Mars and discovered that data needs to be managed. And it’s like this brave new world for them, and they start sharing what they’ve learned and everybody shakes their head and goes, I think we’ve been there for the last 10 years. How do we bring them along? Because we like the enthusiasm, but sometimes it feels like they are reinventing wheels.
Matthieu Jonglez: It is true. I’m having some really interesting and challenging conversations with customers and prospects and even partners in some cases who are exactly in that case. If you look at an AI project and think about AI for the sake of AI on its own in isolation, you might be tempted by some of the things we’re talking about, fine-tuning of models, creating dedicated vector database systems, et cetera, et cetera. So, recreating an ecosystem on the side of the enterprise ecosystem.
It might be great for a small use case on the side, but if you try to generalize the use of AI, generalize the access to the entirety of the enterprise data, you need something, in terms of approach, in terms of architecture, in terms of systems, that scales, that’s secure, that actually can be governed and managed and goes through a proper IT lifecycle. Because those applications will need to be deployed, maintained, they’ll need a build book, a run book, they’ll need to be supported. And that’s very different from lab exercises that we can often see being done.
The real challenge is moving from an interesting exercise that might be demonstrating a small amount of value in a small use case, to something that is delivering tangible value in a much wider use case. And I think that low value… That’s probably not the right way of expressing it, but do you remember the winter of AI a few years back, 10 years back? We’ve gone through two of them. What were the triggers? Hype, over promise, under delivery, [inaudible 00:15:32] technology, siloed data, really complicated deployment approach that required hyper skilled people, for an end game which was delivering very low value mundane scenarios often. Okay, I’m exaggerating a little bit here because there’s been some fantastic advances, right, on medical imagery, et cetera, et cetera. But if you go back to some of the chatbots, et cetera, some of the outcome compared to investments and promises were really poor. We should be very mindful of not repeating those mistakes and errors.
We see far too often low value use cases being used as flagship projects going nowhere when they move into production. So I think thinking about the three, I suppose, hurdles of putting those projects into place, plumbing, how do I wire this in a way that will be cost-effective, can be managed by my organizations, can actually support IT delivery life cycles? The data quality hurdle, how do I make sure my data is represented without bias, is represented in ways that reflects honestly the data that [inaudible 00:16:45] my source system. And the consumption challenges, the consumption hurdle. How do I make sure that that data can be consumed and that data can be exposed and explored in the best possible way, value add, essentially, in the data experience? If I build a system AI based or not, that’s not providing me with any additional value, but just an additional cost to maintain, what have I achieved?
Mike Vizard: So what’s that one thing that you see customers doing right now that just makes you shake your head a little bit and go, folks, we need to be smarter than this?
Matthieu Jonglez: Well, since topic is RAG versus fine-tuning, I’ll use a fine-tuning example. Because I know we’ve sort of strayed from the topic a few times here. The one thing that worries me in many of the projects I’m seeing, in many of the experiments I’m seeing, is the handling of data security. I mentioned it a few times, the role of role-based access control and compartment security in data for proper governance, proper data access, proper data segregation, et cetera.
When you’re doing fine-tuning or when you’re doing any form of training in fact, you are, to a large degree, really increasing that. And some people think, actually, if I use public data or semi-public data or data that everyone in my organization knows, maybe I’m not risking a lot. Well, actually we are seeing customers starting to be facing all sorts of lawsuits. We had a few examples of people whose data has been used in incorrect ways and even for publicly trained systems. So that’s one approach, that’s one challenge here.
There’s also, and I don’t know if you’ve seen that or followed that with interest, but there was a paper by Cornell University, I think late last year, that demonstrated a repeatable technique for reverse engineering embeddings. A lot of the data security aspect in model training was assumed to be deriving from transformers and the encoders, being one way. Going from words and word sequencing to numbers. Using that paper and that approach, and actually it’s our tools now that are democratizing the approach. You can go back from embeddings to virtually it’s original text, give or take. And that means that the data that you’re using for training is no longer safe. That’s an even bigger argument for using RAG. Even though you could argue, and I know I argued that a few times in some of my talks, that RAG is a hack. But it is actually a really valuable approach in securing your data.
Mike Vizard: True that. Folks, you heard it here. AI is an adventure, but just be aware of the fact that maybe you don’t want to create yet another silo off to the side. You might want to just study up a little bit more on data management and all the platforms that are currently available and figure out how to use them most appropriately. Matthieu, thanks for being on the show.
Matthieu Jonglez: Thank you, Mike. Thank you for having me.
Mike Vizard: Thank you all for watching the latest episode of the Techstrong AI Leadership video series. You can find this episode and others on our website. We invite you to check them all out. Until then, we’ll see you next time.