Synopsis: In this AI Leadership Insights video interview, Mike Vizard speaks with Jonathan Ellis, CTO for DataStax, about vector databases and AI.
Mike Vizard: Hello and welcome to the latest edition of the Techstrong.ai video series. I’m your host, Mike Vizard. Today, we’re with Jonathan Ellis, who’s CTO for DataStax. We’re talking about databases, vector databases, AI, Cassandra, and how it all comes together. Jonathan, welcome to show.
Jonathan Ellis: Thanks for having me on, Mike.
Mike Vizard: It seems like everywhere you turn, everybody’s talking about AI, but I think they’re kind of overlooking this whole vector database category and what’s involved in that and what it takes to update and train some of these AI models. So I know that Cassandra has added some vector capabilities and it was kind of a rush, but walk us through why this matters and what are the challenges with setting this all up?
Jonathan Ellis: Totally. So the big excitement here is that since GPT came on the scene last year, it’s when people started noticing it, and especially with GPT-4 at the beginning of this year, there’s been a realization that I can apply AI to my business problems by turning the problems into text that I can give to GPT and ask it to use its general problem-solving abilities to tackle my data.
So rather than in 2022 or 2021, your approach to solving AI problems for your business would be hire a team of PhDs, hire another team of data scientists to clean your data and make sure you have something robust that you can train a model on. Have the PhDs build the model for you, and then go find a big compute farm that will rent you a bunch of H100s that you can build your model from. And with GPT, you don’t need to do that. All you need to do is be able to turn your problem into texts, and then you can say, “Hey, GPT, what are the outliers in this dataset?” And it can solve that kind of problem for you.
And so then the question is, why is vector search important? That’s because, left alone, if you just give the large language model a problem in isolation, then they tend to hallucinate and make things up when they don’t know the answer. And so part of your job, building a modern AI application that’s oriented around these large language models, your job as an engineer is to figure out how to give the model the right context so that it has the answer in front of it and all it has to do is put the pieces together. When you do that, you have a very high success rate of getting useful answers back instead of hallucinations.
So the connection to vector search is given a big haystack of information, all of the shows that Jonathan ever watched on Netflix, you can use vector search to say, what are the most similar shows to Cowboy Bebop that Jonathan just clicked on? And then I can show those in most similar category on my set-top box.
And so Netflix, of course, isn’t doing that with GPT. Netflix has the army of PhDs, they’ve done that the old school way, but now I can do that in a weekend. I literally did do that in a weekend. I built a browser plugin that would let me search my history of sites that I visited. So not just URLs, but I throw the entire text of every page that I visited on the web into a vector database, and then I can say, “Hey, I don’t remember that article I read the other day about fixing the crash in the Java 21 JIT.” I can type that into the search box and the vector search will pull out the most similar articles from that.
So that’s why you see Cassandra and everyone else so interested in adding vector search to their database, it’s so that you can retrieve that most relevant data. And then in turn, give that to the LLM and solve problems in a weekend or a month instead of a year to multiple years.
Mike Vizard: You have seen people touting what they’re calling vector databases, and Cassandra has added that capability almost as a data type that it supports. What’s the difference between having a native one versus one that is extension of a platform like Cassandra?
Jonathan Ellis: I think this is something that you see over and over again in the industry where somebody says, “Hey, this new thing is useful. I’m going to build a product that does that new useful thing.” And then you have existing products that say, “Oh, hey, we can add that on as a feature on top of what we already do.” And now, we do multiple useful things.
So in the case of vector search, we’ve launched this as Astra Vector in our cloud as a service. If that’s all you want to do, if all you want to do is vector search, that’s great, we can support that. But what we see is more often, in fact, a 100% of the time, I would say, you don’t need just vector search to build an application, even an AI-oriented application. You also need the ability to fetch my history of interactions, which is not a vector query, that’s just a time series query.
And so if you’re doing this with Cassandra, you can do both. Whereas if you start with a vector database that only does the vector part, now you need to go and find something else that can do your time series, your CRUD operations, and so forth.
Mike Vizard: How hard is it to set up a vector database? I think people have it in their heads, “This is complicated stuff and I need a specialist.” But is this something that the average DBA can handle? I mean, where are we on the spectrum here?
Jonathan Ellis: Yeah, it’s absolutely something the average DBA can handle. So I’m working on the lowest level at DataStax of the vector search engine itself. But our product line includes the hosted service, it includes DataStax enterprise that you deploy on your own infrastructure. And then we recently open sourced a new embeddable vector search engine called JVector, and that’s what’s under the hood for these other products. But if I were JetBrains and I wanted to do a vector search against the code in my project on my local machine, then JVector would be a good fit for that.
Mike Vizard: Today, we have MLOps folks, we have DataOps folks, and a couple of DBAs running around, and then there’s DevOps teams that are rolling out the models. That’s kind of another software artifact, at least that’s the theory. How do you see all this coming together? It seems like there’s a lot of different teams. And do they all need to work hand in glove or is there some kind of a way to centralize the way we approach all this?
Jonathan Ellis: That’s a good question because I think you kind of got a couple different scenarios where one of them is, “I’m a large company and I already have my machine learning team. And maybe I give this new generative AI to that machine learning team and say, ‘Okay, you guys are in charge of this.’ And so I’m trying to kind of fit it into my existing model.” But then the other scenario is, like I said, one of the really cool things about this is it enables building AI with very small teams that don’t have those machine learning specialists, in which case it’s going to be more of a line of business kind of thing than a centralized machine learning team.
I think there’s room for both. I think you’re probably going to see more exciting innovation coming from people who don’t have that background in machine learning and maybe don’t have to unlearn some of the existing approaches and just try new stuff and see what works.
Mike Vizard: People getting tripped up, we’ve seen enough people kind of start working on this. What do you know now, I guess, that you wish you knew a couple of years ago?
Jonathan Ellis: I guess some of the things that I learned that weren’t necessarily obvious, one of them is which embedding model would I start with in general? So you can use OpenAI’s embedding service to generate your embedding vectors. You can use GCPs or you can use probably a 100 different models on Hugging Face that are open source. And so it is a little bit bewildering about like, “Okay, but which one’s actually going to give me the best results?”
And for general purpose, text embeddings, E5-small-v2… Well, the E5 v2 Family, so it’s called E5, and then they did an update, E5 v2. And then within that family, they have a small and a normal and a large model. And so you can actually measure how good these embeddings are at making similar things close to each other in the vector space. And E5, the normal and the large, they outperform OpenAIs, and the small comes very close to matching OpenAIs. And that’s important because the small E5 is a quarter the size of OpenAI’s embeddings.
And every operation that you do with these vectors, they’re relatively large compared to your other pieces of data. So an integer is four bytes, and E5-v-small is 384, float 32, so it’s about 1200 bytes. And so as those embeddings get larger, your application gets slower. And so having that four times smaller embedding vector means that your operations are going to be four times faster. So that’s one of the things that I’ve learned from the school of hard knocks, if you will.
Mike Vizard: What do you think is going to happen in terms of… Am I going to have a lot of smaller databases talking to a lot of different LLMs? Because some will be general purpose, some will be domain-specific, and I’m going to have to find some way to federate the management of this. How does that look as we go forward?
Jonathan Ellis: So I think the way this looks, and you’re starting to see this in some places already, you’re going to start with probably OpenAI GPT-4. That’s the smartest model right now, but it’s also one of the most expensive.
And so your experiments, you do your prototype with GPT-4, and if it works, then okay, now you’ve got some validation and you can invest more time into trying to productionize it. And part of productionizing, it means, “I’m going to have to get my costs down. I probably don’t need that most capable, most expensive model.” And so you can say, “Maybe I’ll try GPT-3.5 Turbo, that’s eight times cheaper than GPT-4. Maybe that’s going to be good enough.”
But maybe that’s still too expensive, in which case, you can take an open source model that’s maybe Llama-7B or Mistral-3B are a couple recent ones that are very capable but much smaller. And you can fine-tune those to be able to work on your problem domain specifically. It’s okay if they get worse. So fine-tuning means I’m taking on existing model that somebody else has already trained, and I am spending 1% of that training effort to teach it about my specific data.
And often, that means it will get worse at some of those other sets of data that you don’t care about. And that’s okay, what you care about is making it better at this one specific problem. And that’s something that you can’t make a very small model like Mistral-3B or a Llama-7B. You can’t make a very small model good at everything, but you can often make it good at one very specific problem. And that can bring your cost way, way down.
And what people are also doing to accelerate that, remember what I said is one of the things that you’re seeing is people without PhDs and AI are tackling this. And so they’re not looking to build that classic machine learning data pipeline to tackle this. What they do is they say, “Hey, GPT-4, I’m going to tell you to generate some training data sets from my business data that I can use to fine-tune that open source model.” And that’s a very, very effective way to leverage the more expensive model to create a less expensive one.
Mike Vizard: How long do you think it will be before AI is truly pervasive, it’s just going to be embedded in every application we have, and it’s just going to be a capability maybe we just take for grant?
Jonathan Ellis: I think Silicon Valley’s collectively moving that direction as fast as they can. I mean, I look forward to it. So as an engineer, I’ve been using GitHub Copilot, I’ve been using GPT-4, I’ve been using Sourcegraph Cody. These are all AI tools that make me more effective at doing my job. And that’s really exciting, not just as an engineer, but as a citizen of the world, that one of the obstacles to digitizing the world and making people more productive has been that programmers are really expensive. And so you only apply programmers to the very most important problems.
But now if I have AI making my programmers twice as productive, three times as productive, now I can start solving some of that next tier down problem that wouldn’t have been cost-effective to solve with software in the past. So I think that this is going to unlock a lot of potential in every industry.
Mike Vizard: All right, folks. Well, you heard it here. We’re on the path. We’re on the journey. It’s just a matter of how quickly we get there and maybe getting the skills required to figure out what a vector database does in connection with an LLM. But someday soon, this is all going to be standard issue equipment. Jonathan, thanks for being on the show.
Jonathan Ellis: Absolutely. Thanks for having me.
Mike Vizard: And thank you all for watching the latest episode of Techstrong.ai. You can find this episode on our website along with our others. Until then, we’ll see you next time.