AI Leadership Insights: Retrieval Augmented Generation

Synopsis: In this Techstrong AI Leadership video, Mike Vizard talks to Airbyte COO John Lafleur about how a partnership with Vectara is simplifying retrieval augmented generation (RAG).

Mike Vizard: Hello and welcome to the latest edition of the TechStrong AI Leadership Insight series. I’m your host, Mike Vizard. Today we’re with John Lafleur, who’s COO for Airbyte, and we’re talking about a partnership that you have with a company called Vectara that promises to make it easier to build large language models. We’re going to dive in a second. John, welcome to the show.

John Lafleur: Thank you very much, Michael. Thank you.

Mike Vizard: All right, walk us through this partnership. What you guys are trying to do, and I’m not sure everybody knows exactly who Airbyte is, but it has a lot to do with the movement of data and the training of the LLMs.

John Lafleur: Exactly. So Airbyte is an open source data movement infrastructure, and what we do is we enable you to move data from source A, any kind of source to your data warehouse, your database, your data lakehouse or vector databases today. And so we announced lately an integration to Vectara as a vector database and therefore you can sync data from all our sources. We have more than 300 sources to Vectara and therefore to train your LLM to use LLM on Vectara. Now you can do that using all the data you have in all those sources. So let’s say for instance, you want to have data from another database, you can do that now, any kind of source now you can train your LLM and use that Vectara for.

Mike Vizard: Are you going to support other vector databases? And as far as I understand it, there’s a lot of other types of databases that support vector capabilities. So is this a fast-growing category in general?

John Lafleur: Yes, Vectara is not the first one. We have Pinecone, I think we have five of them today so that’s a sixth one and we are adding to the list. Our goal is really to support all vector databases.

Mike Vizard: Who’s in charge of this in a lot of organizations because one of the things about your platform that people seem to appreciate is I don’t have to be a rocket scientist to move data. So are data scientists using your tool or is somebody else getting involved?

John Lafleur: So originally it was data engineers that were building and maintaining data pipelines in-house. And it would take days to build and more days across the year to just maintain those pipelines because a pipeline will often break for any API change, any change at the source or at the destination, the pipeline will break. So the goal of Airbyte is to make that super easy. We handle the maintenance and the creation of the pipeline takes a few minutes. So the goal here is to give a lot of time back to data engineers. Originally there was data engineers, but with the new category of AI, now we see not only data engineers, but we see what we call a new category of AI engineers that power all those pipelines. Their goal is to build and maintenance pipelines similarly to data engineers. But for AI, the AI use case, so towards vector databases.

Mike Vizard: Historically, data management in general didn’t get a whole lot of respect. Is that changing in the AI era? Are people starting to realize, hey, there’s something to this whole garbage in, garbage out there?

John Lafleur: Well, I think with the data warehouses since 2016, we’ve seen that now data is a new goal in sense that to survive, a company needs to have the right data and at the right time. And so that’s why you see warehouse adoption growing that much and success of Snowflake. And now same for if you want to leverage those warehouses or now vector databases for AI, you need well, data and that doesn’t work without connectors. So now we see a lot of, well, since we launched with Airbyte three years ago, we’ve got more than 5,000 companies using us on a daily basis just in three years. And so we see a lot of traction on that. What we’ve seen also is that for data, you have in the end three use cases. You have an analytics use case, which is pretty old. It’s always been the case.
But before that, we would do analytics only on a few connectors. Now with solutions at Airbyte, you would do that with all your data [inaudible 00:04:41], all your tools, so you have all the data to use analytics. The second one is operational, so usually it’s for databases or other or to power your own product. So this one is also very old. It’s always been the case, but the new one is AI and that’s a third use case for data. And it is still nascent in the sense that on our side we’ve seen a lot of companies build it and test, but not really yet really release it to their own customers. And we hope to see that in 2024.

Mike Vizard: These pipelines as you described are becoming more critical in the world of AI. Historically though, the connectors were perceived to be at least somewhat brittle and every time there was a change in the back end with the data sources, there would be a issue with the pipeline. Have we made these things more resilient because it seems like today it’s a whole more dynamic experience?

John Lafleur: Yeah, very true. And that’s why we call Airbyte data movement infrastructure because infrastructure needs to be reliable and stable. And to be honest, this is difficult to do. And so this is a work in product even on Fivetran or Airbyte or other like ETL providers and always will be.
What we do at Airbyte to address this is to build tools. So we have a no-code connector builder where it makes maintenance very easy. So the goal is, and with the goal also is that AI can help it as well. So building connectors directly from AI automatically through our tooling. And so that tooling, what it does is attracts away everything about the connector except the one, the few things that are very specific to that connector. And so today we do that manually and we offer that tooling to the community. But later on we can imagine that AI can help in building but also maintaining. And with that tooling now, we can be a lot more reactive when one connector breaks, we can go back to the no-code builder and fix it in a few minutes. And so maybe AI at one point we don’t know yet, but maybe AI will auto fix a few things. We’ll see.

Mike Vizard: So do you think ultimately we might apply AI to the management of the data so we’re essentially using AI to heal the AI?

John Lafleur: I think with great tooling we can go a long way with AI. Yeah, we might be able to, I need this new connector, nobody has it. Let me have AI build a first version that I can fix, fine tune. That could be something.

Mike Vizard: Or at the very least, find the connector that somebody has because that’s usually half the battle right there.

John Lafleur: Exactly. That’s what we want to do at Airbyte at one point. So you create your connector, you have a YAML, why not share the YAML across the community?

Mike Vizard: You mentioned ETL. Long time ago, ETL was basically a job for somebody considered an administrator, and then that evolved into data engineers and that appears to be evolving into AI engineers. Are these just the same jobs with different titles and more expensive salaries, or are they fundamentally different?

John Lafleur: So to be clear, data engineers is still here and strong and will be here and strong for a long time, definitely. And data engineers can do the job of AI engineers. I guess AI engineers are more specialized, just like at that point, different destinations and different data preparation for AI purposes. So it’s not that far away. So maybe, I don’t know, maybe data engineers can do the work of AI engineers and vice versa. And I’ll say there will always be, data is becoming so important that there will always need a human hand to handle it. Now the whole point of all the toolings always make it easier. So instead of having five data engineers do something, maybe two or three will be enough. Same for AI engineers. So that’s the whole point of the model, like the data stack that we see.

Mike Vizard: Have we underestimated the data management challenges associated with AI? I feel like a lot of organizations are now trying to figure out how to operationalize AI and the thing that they’re encountering is a lot of the data management issues that they ignored for decades are coming home to roost.

John Lafleur: Yeah, no, definitely. What happens is that at the beginning with AI, there were not, for instance, connectors from unstructured data sources. And what that means is that they will build that themselves and they’ll see, okay, we build it. And some companies will think once you build it, you’re good to go. Oh no. Oh no, if only that was true. All the maintenance is where the work is making sure that it works reliably for every use case or your own use case will change too. And so data pipeline is really hard.
So that’s why our goal at Airbyte’s really okay, we are expanding our data sources to unstructured data sources too now. Before that it was only structured one. Now we want to do a unstructured as well so that you don’t need to build that. You can use our open source, our cloud solution and you’re good to go. But we see usage also growing bit by bit on our AI use cases. So we see that as still nascent, but high potential.

Mike Vizard: So for folks who are trying to build these pipelines to feed a vector database that in turn feeds an LLM, what’s your best advice? I mean, you’ve been around these pipelines for a while and what do you see people doing time and again that just makes you shake your head and go, folks, we could be smarter than that?

John Lafleur: Yeah, don’t build. It is very difficult to build and there’s other solution to do that. That’s why we took an open source approach so that you still have full control, but at least don’t start from scratch and also leverage the communities that can help you. Don’t do that alone. It is not easy. It will always break. Everybody underestimates the amount of work needed in the maintenance. If it was only building, that would be easy.
And on our side, our AI journey has taught us a ton about other types of pipeline because when you do a pipeline towards a vector database, the vector database is closer to an API than to a database. It’s a publish where you publish to kind of an API use case. It’s not like a database. So it makes it even more difficult. Internally, you might have built in-house on pipeline [inaudible 00:11:50] your database. It’s not the same for the vector databases. And on our side, what taught us is that, well now we know how to publish towards a destination that is an API so that means next step for Airbyte is to do reverse ETL too. So publish from sync data from Snowflake to Salesforce for instance. That’ll be coming in a few quarters. But that is publishing to a vector database is very different from a database or warehouse, which are the standard pipeline today in a company.

Mike Vizard: All right folks. Well, you heard it here. Data’s the new oil, but the issue is that we don’t really have any way to get the oil to the actual refinery and those things are called pipelines and we need to build them and hopefully you can just reuse pipelines versus constructing new ones because that’s expensive and time-consuming as can be. Hey, john, thanks for being on the show.

John Lafleur: Thank you very much, Michael.

Mike Vizard: All right. Thank you all for watching the latest episode of the TechStrong AI series. You can find this episode and others on our website. We invite you to check them all out. Until then, we’ll see you next time.

AI Leadership Insights: Retrieval Augmented Generation

SHARE THIS STORY

FOLLOW US

AI Leadership Insights: Retrieval Augmented Generation

TECHSTRONG AI PODCAST

SHARE THIS STORY

RELATED STORIES:

FOLLOW US

NEWSLETTER SIGN UP