Mike Vizard: Hello and welcome to the latest edition of the Techstrong.ai video series. I’m your host, Mike Vizard. Today we’re with Mike Finley, who’s CTO for Answer Rocket. And we’re talking about how to convert all that structured data, the stuff that makes the business run, into something that can be consumed by a large language model so we can maybe get a smarter generative AI platform together. Mike, welcome to the show.
Mike Finley: Hey, good. Thanks for having me.
Mike Vizard: What exactly is the challenge that we’re looking at here? Because we clearly have seen unstructured data be used to train these large language models, and yet we do have a massive amount of structured data that’s sitting in databases and all kinds of formats. So what do we need to do to kind of make that data part of the greater generative AI ecosystem?
Mike Finley: Right. No, that’s a great question. And I was listening to a piece earlier today where they were talking about how some of the biggest challenges in causal relationships and data science are being solved with state-of-the-art results by language models.
So clearly, the language models had this ability to understand data in a kilometer format, but you’ve got to bridge over to them to get them to the point where they even know what the stuff is, and what refers to what, and what meaning it has.
So if you think about the way that language models were trained, with these kind of, we’ve all heard it before, it’s guess the next word, or it’s fill in the blank. If you try to play that game with data, if I just give you a row of a thousand numbers and say there’s one missing, what value should it have? It’s almost like an SAT question, it doesn’t have a lot of meaning.
Or if you try to say, yep, here’s 20 numbers, what’s the next one that comes at the end? Again, it’s just lacking a lot of context.
In fact we know that data, structured data, is a digitized version of a business. That’s what it’s doing. It is capturing all these different points in time, and sampling, and measuring all these things where we put sensors in. That’s what data is.
Well, by definition, when we’re capturing all that, we’re losing a lot of information too. We’re losing the color of what makes that information useful. We might know that the truck arrived at 7:00 PM, we don’t know that it was one hour late because there was traffic. So we’re missing the story behind the data.
Now, we try to overcome that by capturing more and more data. In fact, I was laughing because a customer the other day was telling me that he had a KPI, which was how much data could he accumulate? So it’s like data becomes, has its own purpose. And so we try to capture more and more, trying to eliminate that Nyquist problem, where you only know the signal if you capture enough data to fully identify that signal.
So we keep trying to capture more, but that just makes the problem even bigger for the language model. So back to your question, how do we overcome this? And the simple version of it is we hydrate the story.
So one row of data might be socks, 12, 59, 4. And that means that on the 4th of June, I sold four socks for $59. Well, tell the story that way. The language model needs to get the story in that form. At the crudest level, turning structured to unstructured is just hydrating that information.
So it’s not just a string of bits that are coming together, but it’s actually the story that’s represented by those bits.
Now the problem is, if you then try to say, oh, fabulous, I’ve got a hundred million rows, let me make every one of those, you’re going to overflow the prompt right away. So it’s kind of a silly proposition to say, well, I’m just going to hydrate all of each row of my database by filling in like a mad lib, making a sentence out of every row and the language model will love it.
So instead, you take a one step further, which is to say, right, what I need to do is I need to have the specific analysis that tells the story of that data to the language model. What are the outlier values? Where are the anomalies in this? What are the most important drivers of it? Which measures are causing the changes in other measures? It’s what we call the physics of a business, right?
So it’s one thing to say, here’s a bunch of information about where each car was in a race. It’s another thing to say, the blue car passed the red car turn nine, and the green car won the race.
So instead of taking all that data one row at a time, which is how we’ve, for our own convenience, how we’ve sampled what the business is doing and shoved it into a big database for analysis. You’ve got to take it back out, analyze it, tell the story that’s in that data, and then give that to the language model.
Now, then the language model is going to collaborate with you on answering questions from that story. So the database all by itself is not going to be able to do anything more than what you’ve taught it, essentially, with classic programs. It’s going to know how to do things like loss prevention, it’s going to know how to do things like trending and forecasting. It’s going to know how to do things like clustering. All these traditional, statistical, data-sciencey kind of things.
We do those things, we take the output of those as an intermediate step, and then we feed the results that they provide into a language model in a structure that tells a story. And the result of that is, now I’ve got a significant amount of data compressed down into a story which becomes part of what the language model knows, and it can answer questions, be interactive, and make predictions and forecasts off of it.
So I’ll give you, a classic kind of a problem in this space is the idea that prices went up and sales went down, but a language model doesn’t know which one caused the other. It just sees that the price was this and sales were that, the price changed and the sales went down. It doesn’t know that the price going up is what caused the sales to go down.
So again, this is kind of in that category of where we hydrate the information. We tell the language model the physics of this data. The physics are that the causal relationship is price causes sales, not the other way around. The causal relationship is that weather causes supply chain problems, not the other way around.
So we give it that correlation information. We give it enough of a description, not just in a column name of saying, oh, this one’s priced and that one’s units or something. We give it enough of a description of what actually happened. The elasticity of this particular product changed from last year to this year because there is a change in consumer trends or whatever those things are.
We hydrate that information so that it truly is telling a story, and the language model is able to then use that as part of its prompt, as part of its job to be able to respond.
Mike Vizard: So how do we do that at scale? To your point, I have all these rows and columns and things that are all an individual story. How do I convert that into something that doesn’t overwhelm the prompt?
Mike Finley: Right? Right, exactly. And the best answer is, if you look at the really strong disciplines that we have, that we built up over the last few years around machine learning operations, the ability to say, I’m going to forecast every skew across my shelves every day, and figure out where the skews are going.
Or, I’m going to segment my customers once a year and come up with a marketing strategy. Or I’m going to run a market mix model. All these inputs that we’ve had, all these reports that we’ve been generating to give ourselves the ability to run our business, those things have to become automated. The things that humans were doing before to produce these kinds of analyses need to get automated, to essentially produce one intermediate step that then feeds into the language model.
So it’s almost like saying all the work that we’ve been doing until now, with tons of analysts and legions of people that are crunching through numbers, doing all the data prep, all the joining, all the cleaning, that stuff has to get automated so that the language model can consume the output of it. It can consume all the detection of those unusual circumstances, all the SWAT-type analysis.
So if you’re looking for strengths, let’s say. Your goal is to … Sorry, threats. The goal is to identify threats, for example. Well then what you need to be able to do is look at the data you have, where you have weaknesses, and project those into the future. Those become your threats.
So how do you get a language model to help you do that? Well, you simply identify things in the past that were unsuccessful, that you would consider weaknesses, and you turn those into a story. Which then you say to the language model, by the way, here’s some things that went wrong in the past. What should I do next in the current situation?
So it’s almost like the old job of labeling in the machine learning world, where we had to go through and say … I remember when we did 50,000 images from the Humane Society website. That’s a cat or a dog. We did 50,000 of them, trying to train a machine learning model to know the difference between cats and dogs, right?
Well, fortunately we overcame that. Deep learning gave us a solution that does that part automatically. That teaches itself that there’s two groups of things, it doesn’t know that one’s cats and one’s dogs. That’s our word. It knows that these things are all like each other and these things are all like each other, and we happen to be the one saying, hey, it’s a cat and it’s a dog.
So again, it’s these massive data treatments using things like deep nets, using things like classification, using all of our online learning, for machine learning, that’s able to crank through huge amounts of data, applying traditional machine learning. All that gets boiled up into the things that become the inputs to the language model.
So it’s not a competing strategy. It’s actually a very complimentary strategy. To say, yep, now we have some pretty intense tools that we grew up with for the last … Take fraud detection. We’ve done a tremendous amount of work over the last 20 years in automating fraud detection, and none of that goes to waste. All of that is really tightly, deep learning or deep knowledge about how to detect fraud, that gets applied before the language model ever even sees the data, that’s already been applied.
And now we tell it, hey, we’re seeing increased fraud from these regions. We’re seeing increased fraud from these demographics. We’re seeing increased fraud from these product lines, from these merchant types. Whatever those changes are, that story gets fed to the language model in those terms, the same way that you would feed it to a colleague.
If a colleague were asking for a report, they would start with that summarized information that’s provided in a meaningful way, and then they’re going to move it forward down the field, right? They’re going to take it from there and move it into an even more useful form that’s actionable for the business.
Mike Vizard: Is there a certain amount of irony in all this, because we created the structured data to create the shorthand for managing the business. And now we want to rehydrate that data to create the long form of the business to make it consumable by an LLM. So have we come full circle?
Mike Finley: That’s a great point. And you could argue that we have gone too far with the collection of data. If today, if you could have a business manager that understands how operations are going, if they could just summarize how operations are going for a language model, then could we skip the whole process of saving away the data and all that? And the answer is no, you can’t.
The reason is because the language model, by definition, hallucinates. It is a fill in the blank machine. It only works in the world by filling in the thing that comes next. It doesn’t know anything else other than how to fill in a blank. It doesn’t know how to not answer a question. So that means that for everything that it says, we’ve got to be able to reference back to that fact data and say where it came from.
So we do have to rehydrate that story, but we’ve got to rehydrate it kind of like our high school term paper, with all the little ones next to the footnotes, next to the facts in a way that we can get back to the SQL query that gets back to the sensor that provided the data to begin with.
And if we can complete that chain, then suddenly the language model’s results are not just superhuman in terms of the sheer capacity, but they’re also highly accurate because they have those references back to ground truth, that are provided by every one of these steps along the value chain.
So to me, it’s kind of funny. Very similar to the way these language models, even their own creators aren’t sure exactly how they work, it’s kind of funny that every step we’ve taken along the way to get where we are is part of how we can use this latest tool that we’ve created. We need everything we’ve ever done and more to keep going.
Mike Vizard: The interesting thing about this is, you could argue that the structured data is more reliable and more accurate because we took more time and care to create it, or at least we supposedly did. And a lot of the unstructured data is clearly not nearly as accurate because it’s everything from word docs, and spreadsheets, and emails, and whatever else we could find.
So do we need to go through this whole exercise to make the LLMs hallucinate less? And I would just point out when, you and I are wrong, they just call it lying. But when the machine is wrong, they call it hallucinations.
Mike Finley: Right, right.
Mike Vizard: What is the relationship between the ultimate truth here and the accuracy of the LLM?
Mike Finley: Yeah, no, that’s a fantastic question. And I agree. I’ve often thought that same thing.
Look, treat an LLM like you would a colleague. Give them as good a facts as you can give them and check their results. We learned to trust our calculators a long time ago. Nobody’s checked the results of a calculator for a long time. There will be a time when nobody will check the results of a language model. But for now, we’ve got to make sure that we kind of got it right.
So the reason that the language model doesn’t … Let me phrase this. Okay, language model speaks really well, right? You’ve never seen a language model stick the wrong word in a sentence, or drop a preposition, or make some grammatical error. Why is that?
Well, because it was trained on such a volume of data. And every time it was trained on a correct sentence, a hundred times, there might’ve been one or two where somebody typed it incorrectly. But there were so many examples, there was a training bias towards correct grammar. So it doesn’t get the grammar wrong.
Now, let’s suppose you have in your business, in your unstructured dark data, you’ve got memos and transcripts of meetings and whatever, all kinds of information. And in only one place does it ever say, this competitor is going to beat us because they’re bringing in a new product, right?
Well, the language model has only got that one example to learn from. It’s only going to be as good as that one example. That fact is going to have to get indexed. This is where these vector databases, and LAMA index, and these kinds of tools come in. Where they allow us to say, yep, I’m going to have all that information at the fingertips of the language model. So if somebody says something about the competitor beating us with a new product, I’m going to find that right away.
Now, on the other hand, if there are 18 different places where a piece of knowledge gets referenced, where all of a sudden I’ve got, 17 out of 18 people say one thing, and there’s one example where somebody says something different, the language model, again, is going to use that same kind of training bias to say, yep, those 17 examples are correct. And that one example is probably not, right?
So that’s how it’s doing it. That’s how it’s getting through this challenge of knowing what to do with all this unstructured data. Especially when there’s such huge volumes of it, it’s often incomplete. It hasn’t been scrubbed. Because like you said, you can trust the database. You can’t trust your hard drive with whatever’s in your mail history.
Mike Vizard: Do you think the compliance folks are going to sort this all out and come to understand that? And they’ll be coming knocking one day, asking some difficult questions?
Mike Finley: Oh, it’s really interesting. I was in Silicon Valley last week and there are companies who are signing up to provide this role, right? And it makes total sense. The idea that there would be a company who would say, our job is to make sure that that language model is not getting any information it shouldn’t, because it’s in the cloud. That it’s not hallucinating anything back to you because it’s not a fact.
So this is going to become a discipline in into itself. And yeah, I think anybody that’s getting ISO-certified, or anybody that’s getting a SOC2 certification, there are going to be requirements. Annual pen test, have your hallucinations checked, and get your oil changed. It’s going to be part of how life works.
Because by definition, we are essentially automating the workforce in many regards. And by the same token that we wouldn’t trust one of our employees to go off and do rogue things without other employees checking that along the way, that we’re going to be doing that for machine learning models.
Mike Vizard: It’s like having a junior employee that you constantly have to check their work and see what’s going on.
What is your best advice to organizations then? I mean, what should they be doing? How should they be approaching this? Because right now I’d say we’re all a little LLM happy. Everybody’s building an LLM for everything and anything they can imagine.
Mike Finley: That’s right. Well, first of all, I would say being LLM happy is the right thing to be. I mean, let’s be clear, this is the biggest thing that’s happened in tech in my lifetime. And if I were a hundred years old or 200 years old, I think I would be saying the same thing, right? This is huge. We’ve discovered alien in life. So be LLM Happy. Companies who are not are going to get left behind.
So that being said, what to do about it? It’s not just having one intern. It is a hundred thousand interns. It’s as many interns as you can turn on. It’s AI on tap, or interns on tap. And so what can you do and should you do? Simple answer, first thing that’s really easy is, all of that unstructured data that’s been dark, that’s buried on people’s hard drives and in PowerPoints and all that?
Start getting that out. And again, talk about this ecosystem. There’s new providers every day. And some of those new providers are emerging to do this service for you. Get it out of wherever it’s hidden, get it indexed. So we’ve been working on ETL for structured data for years. We’ve developed all these data pipelines and ingestion tools and all this stuff. We’re going to have to have some of those for the unstructured data too.
Now, we don’t actually make a copy of all that data. We just index it. And indexing just means we flash it in front of the language model and say, give me your thoughts. And so we take a hundred million documents off of people’s hard drives and wherever they’re sitting, and we show those briefly to the language model, we get its thoughts and we put those in a vector database.
And now that information becomes immediately findable, right? All of a sudden that information is available. That’s a good thing for companies to go do.
And by the way, you can turn that into … In fact, we did this. We have an intern who turned our dark data into a productivity application for the entire company. Simply by saying, yep, everybody throw your stuff into this bucket. Every shareware tool that we’ve ever had out there whose job was to get information to move between teams, all that content is now in one place.
And the help desk can see the stuff that the sales guys see, and the sales guys can pull in new documents from the customer. It’s very, very easy to do. So that’s a good thing for everybody to do.
And once you have that done, once you have all that dark data indexed, so now it’s available, you can start curating it. You can try to throw away things that are irrelevant, and try to update most recent documents. But don’t create a huge bureaucracy around that. That could become a thing that has the mind of its own.
Just get it in there, and to the point where the language model can see it. And then start taking your BI tools. Wherever you do BI today, build the interface bridge that takes that BI content, either the reports themselves and dashboards or back to the data that they come from. Get that content, your structured content, and start pulling in the unstructured work. Start cross-referencing it with the index from the unstructured. And really quickly, you can be up and running in a productivity environment.
In fact, I’ve said this about AI in general from really all the work that I’ve done in AI for the last 15 years. Don’t make a giant project with a six-month waterfall deliverable that takes hundreds of millions in resources. Do start quickly. Get something going in a couple of weeks and iterate on that. Because this technology lends itself to that really well.
There are no global experts, except maybe the guys that have made these language models, and there’s only a handful of them in the world. Everybody else is new. To prompt engineering, to prompts that are able to call functions, to language models, to vector databases. This is really new stuff for everybody.
So start small, iterate, build up people on your own team that are really smart about it, and leverage every asset you’ve got already. Your existing databases, your existing content, then the existing portals that you have, the infrastructure that you’ve built. You know how to run your business. Just run it better with this tech.
Mike Vizard: Do I need all these data engineers and Data Ops folks, or can I just do this as a mere mortal?
Mike Finley: Look, my career has been built around selling technology. And it’s inevitable, every time a technology purchase is made, the people making that purchase want to know the savings that’s going to come from human labor because of that piece of technology. That’s been true for 30 years, whether I was making cash registers or every other thing that I’ve built along the way, including the AI stuff.
The fact is, all those employees are still there at all those businesses. And what happens is, the productivity goes up, the benefit to the business goes up. So those employees become even more important. So do you need all those engineers doing whatever repetitive tasks they’re doing now, of scrubbing data, connecting data, whatever? Looking for the lost records, filling in, missing dates?
You may not need them doing that job. But if they know your business and they know your data, and they’re really good people that you can trust, up-skill them. Train them into this new model and take advantage of them, because you’re going to need every one of them to beat your competitors.
Mike Vizard: All right, folks, you heard it here. It’s not so much about replacing people as much as it is letting them do things at scale that previously might’ve been unimaginable. Hey, Mike, thanks for being on the show.
Mike Finley: Thanks for having me.
Mike Vizard: Thank you all for watching the latest episode of Tech Strong.ai Video. You can find this and other episodes on our website. We invite you to check them all out. Thanks for being with us. And until then, we’ll see you next time.