AI Leadership Insights: Testing Large Language Models

Synopsis: In this Techstrong AI Leadership Insights video, CircleCI CTO Rob Zuber explains why there is a need for a free short online course on how to automate evaluations of large language models (LLMs).

Mike Vizard: Hello and welcome to the latest edition of the Techstrong AI Leadership Insights series. I’m your host, Mike Vizard. Today we’re with Rob Zuber, who’s CTO for CircleCI, and we’re talking about a course that they have created in conjunction with deeplearning.ai to help people figure out how to test all these large language models that are floating around out there. Rob, welcome the show.

Rob Zuber: Thanks for having me. Happy to be here.

Mike Vizard: What is the challenge exactly with testing these things? I mean, you would think they’re just another type of software artifact. We should be able to test them easily, but I’m assuming there’s a lot of nuances here.

Rob Zuber: Yeah, you’re right, and you’re not, I guess in a sense. That’s particularly interesting about it. And the course, the reason we wanted to put it together was really that I think that we have all the capabilities. But people haven’t really figured out how to put them together to achieve what we’ve been able to achieve with the software artifacts that we’ve been building for a while.
And so the real standout item is non-determinism. I think that’s the easiest way to describe it. But basically from a software perspective, over the many years that we’ve been working on automated testing, et cetera, we’ve focused on always getting the same result. And then as we’ve introduced LLMs and in particular using LLMs to drive software capabilities. So I’m building a product and I want to have some generative AI based capabilities within that product. The results can vary, and that’s expected. That’s part of what makes generative AI so interesting. So give it that sort of human feel in a sense is that it’s not always exactly the same thing, right? I’m generating some content and it’s shaped a little bit differently each time, but there are boundaries to what is good.
So you want to make sure it’s within the bounds, but that’s much more difficult than what we’ve done in the past, right? It’s not a one plus one equals two problem. It’s, this is generally the shape of what we would expect a good answer to look like. The tone is what we want it to be. It’s not making up… The term is hallucinations, but not making up things that don’t actually exist, explaining things that don’t actually exist. And there’s lots of great stories out in the world about where that’s gone wrong for folks.
So keeping that in the bounds, but keeping that sort of natural feel and being able to take on cases and conditions that you hadn’t planned for. I mean, that’s again, some of the things that AI makes so… Just new things that are made possible by AI is that you don’t have to know the exact answer out of the gate because these LLMs can sort that out for you.

Mike Vizard: What are the metrics that I’m looking for here? ‘Cause as you described it, some of these responses, measuring them is a little, shall we say, I’ll use a technical term mushy. How do we know whether or not the thing is within acceptable parameters in a way that can quantify?

Rob Zuber: Yeah, so there’s kind of different levels. I mean, the simpler kind of rules-based checks are more similar to things we’ve done in the past. Do these words exist? We would always expect that to exist under these conditions or whatever, but that’s limited. And then what feels challenging to think about, I think, and was tough for me to get my head around at the beginning, is the next step is actually using LLMs to test the output of LLMs, right? So it’s called model graded checks basically. And you… Or model graded evals, but you take the output and then effectively feed it back and say, “Is this within the bounds of what you would expect, sort of thing. Does this look good?” And there are different levels that you can make that work at. I mean, you could obviously use different LLMs, which will give you slightly different grading.
But it’s not like the LLM has a personality. It’s not saying, “Well, that’s my work. So of course I think it’s good.” I mean, it doesn’t know that it created that artifact, right? Although it has been shown by some researchers that there tends to be bias, which is not particularly surprising. Like grading the work of… If one model grades the work of itself, it’s going to grade it higher than the work of other models because that looks closer to what it was trained on or whatever.
So you could go to other models to check. You can give specific parameters to check against. There’s lots of different ways of thinking about that. But you’re ultimately always rating a score, right? You’re saying, “This looks about 80% good, 90% good versus it’s binary, good or bad.” And then looking at thresholds, is it over the sort of percentage that we expected from a quality rating?
And then trends, are we getting better or worse over time? Oh, it looks like it’s degrading. And that’s not only something you would do in a testing scenario, but then that’s also something that folks are doing from a monitoring perspective, right? Over time, as I’m seeing the responses that I’m putting out to my customers, is it changing over time? Is that something that we could go back and look at?
But ultimately, again, it’s at first it’s accuracy and relevancy. Those are the top obvious things. But then there are other bounds like tone, is it offensive? Is it choosing words that we wouldn’t want to have associated with our brand? All the way to just sort of steering completely off course, which again, we’ve also seen some interesting media examples of.

Mike Vizard: Where does this testing take place? Is it shifting left as it has in earlier cycles for testing, and we’re going to see data science teams do this? Or is there an ML ops platform engineering team that’s taken this over? Where are we in this curve?

Rob Zuber: Yeah, it’s a great question. I mean, it’s in a lot of different places, and not everyone has gone through this process of automating this and shifting it to the left. I think… When I think about what data science teams doing, that’s primarily going to be around the models themselves as they’re building them.
What we were trying to help folks with and what we are trying to help folks with is now you’re at the point where you’re building the model into a product, leveraging the model to create some specific capabilities in a product. So to give a concrete example, we actually built something which is an error summarizer. So something breaks in your build on CircleCI and you can click on a button and get a summary of all the logs. This is what happened, and this is what you should do to fix it. And so we would want to say, “Does this look like it? Does it make sense? Is it readable? Is it the right tone?” I mean, all those things we would want to be checking for, right?
So testing that, and then that’s something that you can test unsurprisingly in your CI pipeline. But there are… Again, there’s the simple sort of rules checks, and then the model graded stuff. What’s interesting about the model graded checks is they tend to be slower, and they tend to be expensive because you actually have to make a call to an LLM pay for those tokens. Basically, you’re using up credits as you do that testing. And so at a high volume, if you’re doing that at the rate that you would do it every time you change something on your laptop or whatever, it’s going to be expensive.
And so, one of the things we talked about in the course was breaking that out to do the cheaper checks on every change. The more comprehensive stuff, maybe only when you merge to your main line and you’re looking at something that would go to production. Or there’s different ways you could do it, but optimizing around sort of costs and performance.
At this point, I think people are just trying to get their head around how to test. But we feel like that’s the next level where it is a little bit different because you’re not just running code against code, you’re actually calling out to other services and paying for the use of those services.

Mike Vizard: Do you think we’ll get to a point where auditors are using this same capability to evaluate AA models from bias and whatever it may be? It’s not just at the front end, but it could be regulators at the back end.

Rob Zuber: I think we’ll see these approaches spread. Absolutely. I think it depends, obviously where regulators end up in this, what it is that they’re looking at and looking for. But there are some well-established practices at this point to do this kind of testing. And again, in that case, going back to the bias that comes from testing with the same model, you would say, “Oh, okay, well, we want to make sure this new model that’s coming out is good, we’ll test it with the… We’ll run it through a different one to see the response that we get or the grading that we get out of that one.”
And it might be… You could certainly create models that are designed specifically for that kind of evaluation or making sure that things are within bounds. I mean, I think we’re at the very, very early stages of exploration of how to use all these tools. I mean, we’re kind of a year and a few months into the general population and most technical folks even really understanding what LLMs are or how we can apply them. Because there was just a big shift in availability, and so we’ll continue to evolve on this.
But to your point, regulation, how we’re thinking about acceptable use I think is another path that we’re going to go down. This is a tool that those folks have to use for that as well.

Mike Vizard: Where are we in terms of the relationship between what we’re calling ML ops, which is what the data science folks are using to build the models and traditional DevOps or DevSecOps where we have used that to deploy software artifacts? Do these things need to converge and what does that look like?

Rob Zuber: Yeah, I mean, I do think this is a convergence story over time, which is using all of the capabilities that we’ve evolved through DevOps, DevSecOps over the years, plus what we’ve learned in the data science teams. And up until now, it feels like we’ve operated a little bit separately. I mean, there’s different toolkits, different knowledge and understanding, and I don’t know… Part of it is, we will absolutely leverage each other’s tools. The approach of evals in software as we’re building software is taking things that were already known by data scientists, ML Ops folks, and bringing that into more standard CI/CD delivery pipelines.
And then on the delivery side, we have tools beyond that, beyond just automated testing. Think about controlled release, canaries, blue-green deployments, those sorts of things. I mean, if you’re making significant changes to a model, you want a tool to manage the risk of putting those changes out to your customers. Doing that in incremental roll-outs, having the ability to roll back quickly. Those kinds of capabilities we’ve invested a lot in over the years from a DevOps perspective, because we know that’s how you manage risk effectively. And so applying that to how we roll out models I think will be very valuable for all of us as we try to move faster and innovate.
But we don’t necessarily know what the outcomes are going to be like. You’re never going to be right about everything. I think that’s one of the core lessons of DevOps. So you build tools to manage the risk of being wrong, and that allows you to experiment and move quickly without putting yourself at risk. And so bringing that to how we roll out models and ML tools, I think is a good overlap.

Mike Vizard: I think it’s not much of a secret that a lot of the software that we put into production environments is, shall we say, tested inconsistently. In the age of AI, can we afford that? It seems like the risk levels are much higher. And so will testing become much more of a highly required process?

Rob Zuber: Yeah, I mean, I think it’s always a risk calculation, and I think to your point, not everyone is great at calculating the risk from a pure software perspective. I do think there’s a lot more attention being paid from an AI perspective. Because it can be easy to believe that you know how it’s going to perform, but the boundaries are much less clear. I think it’s probably easier to simulate the real world conditions of traditional software because there are only so many potential inputs and outputs, and that’s much broader, particularly as you get to the world of LLMs.
So I do think there’ll be attention paid to this. I think the people who are doing it well are making choices to add more testing. But I think a lot of it is early, people don’t necessarily have the tools, which is part of why we wanted to do this and help get more knowledge out there about how you can achieve this.
And I do think it’s likely that we’ll see some bad scenarios in terms of poorly managed, poorly tested roll-outs that will lead to regulatory concerns and then more enforcement of what people do. But it’ll be interesting to see how folks… Or what that looks like, like really, how do you describe what is required in terms of testing? I mean, I think all of us would love to see… Well, I don’t know if I could speak for everybody. But what I would like to see is folks who are building these things, taking that responsibility ’cause it really is going to be the best thing for their business and avoids a scenario where it feels like we have to have…
I go to highly regulated industries and the process of approval to put something out is not what I think about when I think about great software delivery. If I go to the other end of the spectrum and think about healthcare, I want it to take a really long time with a lot of oversight to put a new pharmaceutical into the market. But I think we have the opportunity to do a really good job ourselves as tech companies delivering this stuff and do the right work so that we don’t end up in that kind of scenario.

Mike Vizard: I feel like it took us a long time to get to automated testing in the traditional software development. Is that going to be a shorter cycle here or are we going to learn from the past?

Rob Zuber: I think so because we have most of the tools. We have the ability to do automated testing. We know how to do CI/CD pipelines and controlled delivery. We have the risk mitigation tools on the rollout. And we have the tools from the data science side of evals and certain model graded evals and those sorts of things to evaluate this sort of stuff. We just have to put them together and use them.
And I think, I believe we’ve learned the lesson of how much value that brings to your software delivery. That it’s very positive to be able to move quickly and with high quality, and we’re going to want to apply that here. I think the people who apply it well will move quickly and continue to deliver with high quality and not end up with very negative outcomes for their brand or whatever that might be. So I think the lessons have been learned, and it’s a matter of piecing together the bits to do it well. I don’t think we’re sort of back to the beginning and questioning whether this is how you deliver software, if that makes sense.

Mike Vizard: Are we going to wind up using LLMs to test LLMs, and then how will we test the LLM to test the LLMs? Follow my circular thinking?

Rob Zuber: Yeah. Yeah. Like I said, it was a little hard for me to get my head around LLMs testing LLMs at the beginning. Because I feel like if it was truly a human and I said, “Did you do your work properly?” They would always say, “Yes.” But that’s not really what’s happening. In particular, again, you could use a different LLM, a different model to process the output and give a grade to it.
And I think there’s plenty of opportunity in there. I think we will have some cases somewhere along the way where we don’t get the result we expected, whatever. But ultimately, I think in the short to medium term, what we want to be doing is grading and then checking, sort of trust, but verify. There will be human monitoring. We pay attention to our own software. We have fully automated testing for our software at CircleCI, but we still use it. We still go look at it and say, “Hey, this doesn’t look quite right. Let’s fix that thing.”
And then I would say the other big thing, depending on the industry, we’ve talked a lot about highly regulated industries. Highly regulated industries are not going to be the first to just trust the output of an LLM that was checked by an LLM sort of thing. So there’s an expression, “Human in the loop,” that’s used quite regularly, which is I am getting the output of the LLM. And then I am using that to provide the ultimate output to the consumer kind of thing.
So I know some folks that are working in the insurance industry and they’re using generative AI to process needs, whether it’s claims or applications, depends on where you are in the industry. But to summarize and generate the data that they need that then they look at and say, “Okay, this looks right, or this doesn’t look right,” sort of thing. So it’s speeding up the process, but not the ultimate decision maker. So there is still a check in the whole process of a human who’s highly qualified, but humans also needs sleep, whatever. There’s still opportunity in those complex scenarios for humans to make errors.
So you’re getting benefit, you’re getting a boost, you’re moving faster, probably more accurate in some cases, but you still have a check that you’re not going totally out of bounds. So I think it depends on the scenario, on the risk profile of that particular scenario, et cetera. But I think all the tools are there and we’ll see as we continue to do this, we’ll find the rough spots and we’ll work them out. I mean, that’s what we do in every industry and every piece of technology.

Mike Vizard: So is this the year of AI or is this the year where we figure out maybe how to operationalize it, but it’s really going to be like 2025 before we see this pervasively apply.

Rob Zuber: I think we’re going to continue to see innovation and ramp in this space. Some of it will be operationalizing, some of it will be finding new use cases and unlocking some new things. And some of it will be realizing that what we thought was going to be really exciting is kind of not that interesting or whatever. We will test a lot of new things. I think that’s a cycle that we’ve seen. We’ve seen kind of entire companies come and go in the last 12 months around this space. Oh, this is totally innovative and interesting. Actually, that’s not really that interesting. It doesn’t help, or it turns out that’s just a thing that the LLM can do on its own. It’s not a business. On top of that… There’s been a lot of that sort of churn over the last 12 months.
And I think that’s what makes this space exciting, is people try things. It’s fun. We learn new things and then we take that and we build on it. And so I think we’ll see more concrete applications and more operationalization. We know how to do this well. We know how to be confident in the results that we’re getting, which will accelerate us as we go. I don’t know that we’ll see… I mean, I wouldn’t have predicted the last one, so it’s hard to say. I don’t know that we’ll see a whole net new space, but we’ll see fine-tuning, new use cases, some use cases terminating.
The other thing that I think actually we’ll start to see… I don’t know if this is the year of it, is generative AI and LLMs sort of opened everyone’s eyes to, wait a second, there are some amazing things that I can achieve with AI. This is a real thing, we’ve been talking about it forever, right? For it’s been since the fifties or whatever. Last year, people really got their head around what was possible. Part of that was you could just type into a chatbot and get real… You didn’t have to understand the depths of the tooling, the technology to really see the potential.
But now people will start to see, “Oh, this is an amazing high powered general application of AI. But for my thing, there’s actually this other simpler way of going about doing it.” And I think people are starting to re-explore the full space and figure out which techniques and tools from the space of AI are going to be really valuable for them in the applications they’re building.

Mike Vizard: All right, folks. There’s going to be lots of large language models, but there’s one rule you should always remember, measure twice, cut once. Hey, Rob, thanks for being on the show.

Rob Zuber: Thanks for having me. It’s a pleasure as always.

Mike Vizard: And thank you all for watching the latest episode of the Techstrong AI Leadership series. You can find this episode and others on our website. We invite you to check them all out. Until then, we’ll see you next time.

AI Leadership Insights: Testing Large Language Models

TECHSTRONG TV

TECHSTRONG AI PODCAST

SHARE THIS STORY

FOLLOW US

AI Leadership Insights: Testing Large Language Models

TECHSTRONG TV

TECHSTRONG AI PODCAST

SHARE THIS STORY

RELATED STORIES:

FOLLOW US

NEWSLETTER SIGN UP