Synopsis: In this AI Leadership Insights video interview, Mike Vizard speaks with Anupam Datta, co-founder, president and chief scientist of TruEra, about testing AI applications built on LLMs.

Mike Vizard: Hello and welcome to the latest edition of the Techstrong.ai video series. I’m your host, Mike Vizard. Today we’re with Anupam Datta, who’s president and chief scientist for TruEra. They are a provider of a platform that helps you test all those applications you want to build with large language models. And we’re going to jump into, well, where is all that testing happening? And maybe we are not doing enough of it. Hey, Anupam, welcome to the show.

Anupam Datta: Thank you, Mike. It’s a pleasure to be here.

Mike Vizard: I feel like this is not a new story. I feel like we’ve been having conversations about testing applications forever, and we either run out of time or the people who need to run the test don’t know how to build the test so it doesn’t get tested. And then we deal with all this stuff on the back end and we increase technical debt. Here we are building new applications again with large language models, and it seems like it’s a similar conversation. So what’s different here and how can we come up with a different outcome?

Anupam Datta: Yeah, that’s a great question, Mike. One thing that’s different with applications built on large language models is now we are talking about generative AI applications as opposed to more traditional software applications where failure modes are in some ways a lot more code-based and simpler to catch. But with generative AI, a big chunk of the source of these problems goes all the way back to the data. And so the link between what’s causing the problem versus what’s not is much harder to establish. So maybe it’s useful for me to give you a couple of examples of the kinds of failure modes that we are observing and some ways of testing for them and improving models and applications.
So imagine that you’re having a conversation with ChatGPT or some kind of chatbot like that, which is getting heavy adoption, as you know. And one kind of failure mode is language related failure modes, meaning that when you usually ask a question in English, you will get the response back in English. A lot of the training data on which these models were trained is English, and therefore it has done a good job of recognizing that language. But what we often hear is that if the questions are asked in other languages, sometimes the response comes back in English. So this is an example of language mismatch problem where the users are asking questions in one kind of language, but they get the response back in a different language. And to address these kinds of, so one of the things you can do during testing is to have language match tests built in, something that you can do with several different links. But in particular, there are machine learning models, simpler machine learning models, that can check if two chunks of text are from the same language. And then if it’s not, we can flag it. And then there are ways of telling the large language model to always respond in the same language in which the question was asked, using a form of prompt engineering, which can be the way to resolve this kind of issue.
So that’s one kind of example of where failure modes arise. The other, which is a more general and interesting but also ubiquitous form of issues is around hallucinations. Large language models can make stuff up. And in some ways if you think about how they’re trained, unlike more traditional machine learning models, they are trained to be generative, meaning that they see lots of examples of text from all over the internet during the training process. And the goal of the training process is to produce reasonable looking text in English let’s say, or some other language, based on all these examples that they have seen. So in particular, that might mean that for general kinds of questions you might get reasonable answers. But if you have very specific questions, you might see that a large language model makes stuff up. So I’ll give you a concrete example of this.
One of my co-founders asked one of these chat models to answer the question of who are the founders of TruEra? And there are three founders, ChatGPT got my name correct, and the others were just made up. They were plausible sounding names. And actually they were real people, but not the founders of TruEra. And it also said that TruEra was acquired by SAP in 2019, which is again a made up fact. And this is ubiquitous because what happens is that we are not that well known as a company yet, it did get the Google founders right because it has seen enough examples of who the founders of Google are. But when you start getting into more specific types of questions, it can make stuff up like this example.
And in fact, in enterprise use cases and lots of enterprise use cases, the questions that are being asked of chatbots and question answering systems tend to be more specific. So at a telecom company, let’s say it’s supporting a customer service agent, then they want to have our customers who have questions, they’ll want to know how to solve a very specific thing. I have this issue, my network has gone down, what should I do to troubleshoot? And those are the areas where LLMs can fall short. So those are two examples to kick off the conversation, Mike.

Mike Vizard: Aren’t we building these applications too quickly then? And do we need to take a moment to think through how this whole process is going to work?

Anupam Datta: Absolutely, absolutely. So our point of view is that evaluation and testing needs to become an integral part of the development of these kinds of applications. That you build your large language models, and then as you’re using them as a building block in larger applications like a chatbot application or question answering or summarization, you need to essentially set up a test harness with a battery of tests and extensible tests that are customized to your use case, run them carefully, find the blind spots and failure modes and debug and improve your application before you move it into production. And then once it’s in production, you need to have monitoring tools in place to do ongoing evaluation to continue to maintain the quality of these kinds of applications. And so that’s one part of the tech stack that’s being called observability for LLMs and generative AI in particular. That’s starting to come together now, it’s early days still in this area. And at TruEra, this is an area of focus for us.
But if you think about this, Mike, and you asked this question about the difference from traditional software engineering and applications. This observability for the applications has over time become a standard part of the tech stack for traditional software. Application performance monitoring companies like AppDynamics, Datadog, these are multi-billion dollar companies that are serving that need for traditional software. And the analog of that for generative AI and AI more broadly, this observability layer of the tech stack is just coming together as we speak and more attention to that will be important as we move forward in this space.

Mike Vizard: Is this part of that whole MLOps workflow? I mean, in the DevOps world, we have testing and observability. So are we looking to apply the same thing as part of that MLOps workflow?

Anupam Datta: Absolutely. Absolutely. The MLOps is the analog of DevOps and for the machine learning platform and operational workflows. Observability is analogous but also different because of the data dependence of machine learning, which you don’t see in traditional software. But I would go a little bit beyond that to say that MLOps has started to mature for traditional machine learning models, which are often discriminative models, meaning that they are doing tasks around classification and scoring. If someone is a transaction fraudulent or not, how likely is it to be fraud? That kind of discriminative tasks, meaning it’s separating out instances that are fraudulent from those that are not. As we go from that class of machine learning models to the generative models like the chat models, multi-tone chat models or models that are generating text or marketing copy or sales outreach and so on. The tech stack is starting to, this area is starting to get called LLMOps as an augmentation to MLOps, but it’s also significantly different because of some of the reasons that I mentioned earlier. But maybe I can add a little bit more there.
The way that we are building LLM applications is dramatically different from how we were building machine learning applications with the more previous generation of models. And in some ways it has gotten easier because traditionally for MLOps training models was a huge part of the process. And that was hard, it took machine learning experts and a good amount of tooling to do it well. Now what we are seeing is the emergence of retrained models. OpenAI has built a whole bunch of models and you can simply access them over APIs and build out applications rapidly in the course of days or weeks as opposed to six months, which is what it took before. So there’s going to be increased proliferation. The rate at which useful applications are being built has dramatically accelerated. Software engineers, full stack engineers can put together applications in days or weeks, which would have previously taken machine learning experts in addition to software engineers to work together and produce in months.
So in some ways, what has happened is there’s been more democratization of application building powered by large language models. So now if you look around, you’ll see there are very regular hackathons where over the course of a day and eight hours or over the course of a week or two, really significant applications are getting built. And these are powerful applications. And one kind of takeaway that I will leave your audience here with is that pretty much every knowledge worker will very soon have incredibly powerful copilots to assist them in their tasks. Whether it’s a educator who’s helping tutor students, it’s someone in the sales field who wants to do outreach emails, someone writing marketing copy, folks working in law and medicine. And so as this proliferation happens over the next couple of years, it’ll become increasingly important to keep in place, to put in place this LLM ops layer for observability to ensure that applications are carefully evaluated before they get pushed into production. And then are on an ongoing basis, they continue to be evaluated as well.

Mike Vizard: Do you think ultimately there’ll be a backlash against some of those applications because we failed to test them and there’ll be this kind of pushback from folks who will be like, “well, AI doesn’t work because X, Y, and Z”, but it really just comes down to the fact that we didn’t test something.

Anupam Datta: Yeah. I think that’s a very real risk that you’re highlighting Mike. And to some extent what you’ll see already is there are lots of examples in the news of incorrect answers and hallucinations. So I think part of it is about testing, much like you’re pointing out around testing and observability more generally in production. And the other area, which I think is also incredibly important is education. So that the users of the system also have the right kind of view about how to effectively use these applications. What they’re good for, what they’re not good for, where the blind spots are. So on the education side, you may have seen this story from some time back where a lawyer used ChatGPT to produce document that then he went on to submit in court and it had all these made up stuff.
And the opposing counsel discovered that all these cases that were being cited just did not exist, they were just completely made up. So that’s a good example of someone who used the technology without understanding the limitations that when you start getting into specific type of information like case history and so on, just relying on a large language model is not a great way to do it. So for those kinds of problems, you’ll want to have a large language models augmented with knowledge basis, which become a source of truth. Against which the application uses the large language model for certain tasks that generalize well like summarization, but the retrieval then gets done against a known source of truth that is traceable so that when you produce the final answer and response, you know what the source documents are from which you have pulled together the final response.
So that element of traceability is not present in large language models in isolation, but there are architectures for applications which combine large language models with databases, in particular vector databases, where you can provide that kind of traceability. And so I would say you’re absolutely right that testing and observability needs to become part of the tech stack in a very significant way otherwise there will be backlash, some of which we are seeing already. But also a broader education in this area so that it gets weaved into, even starting from elementary school human interaction with artificial intelligence systems needs to be part of our daily education system. My daughter is eight, and I remember when ChatGPT first came out, her interacting with it, I went by their classroom and gave a talk. And it’s quite dramatic everything that they’re doing from math, writing, even getting into areas around reasoning through various problems, these are all areas that these kinds of technologies are starting to get quite good at.
And so I think one thing that we will want to do is to view this technology as becoming co-pilots in a lot of our learning activities right from elementary school, and then having educational programs incorporate that across the board from school to college to adult education so that we are well-equipped to deal with a world where we will have these incredibly powerful assistance that can also sometimes be dramatically wrong. And we have an understanding of how to go about using them effectively while ignoring them when they make mistakes, and as a copilot, not as an equal.

Mike Vizard: All right, folks. Well, you heard it here. At the end of the day, if we don’t test, you will be proceeding at your own peril. Anupam, thanks for being on the show.

Anupam Datta: Thank you, Mike. It was a pleasure. I appreciate it.

Mike Vizard: Thank you all for watching the latest episode of Techstrong.ai. You can find this and other episodes on our website. We invite you to check them all out. Until then, we’ll see you next time.