Testlio today launched an automation service specifically designed for humans who have been tasked with testing artificial intelligence (AI) agents.

Announced at a Money20/20 Europe conference, the AI Agent Testing service from Testlio enables human testers to leverage AI to test and validate whether AI agents are performing tasks within the parameters that have been defined.

That approach ensures that humans are validating AI agents that are prone to hallucinations or may have found a way to bypass whatever guardrails have been put in place, says Testlio CEO Summer Weisberg.

That testing capability will prove crucial as organizations rush to deploy AI agents that, in many cases, are not being thoroughly tested before being deployed, she adds. As a consequence, many of those organizations are about to encounter a host of issues ranging from security breaches to excessive consumption of tokens, notes Weisberg.

Other issues organizations are likely to encounter include incompatibilities with specific devices or agentic workflows that are for one reason or another accessing the wrong data.

At the core of the Testlio service is a community of application testing professionals that are contracted on a per-project basis. Each AI agent is then given a confidence score using LeoPulse, a proprietary AI-based framework developed by Testlio. Existing Testlio customers include PayPal, NBCUniversal, Strava, Adyen, Dlocal, Uber, and Solidgate.

Over time, there is little doubt that AI agents will eventually be relied on more to test other AI agents but there will always be a need for humans to be in that loop, says Weisberg. “The magic will be figuring out what testing elements are best performed by AI or a human,” she says.

The probability that there will be multiple incidents involving AI agents that will be attributable to testing oversights is likely to rise sharply in the months ahead. Each organization will need to determine to what degree to integrate the testing of AI agents with their existing software development workflows or, alternatively, create a separate dedicated function. Regardless of approach, failing to test AI agents that are capable of performing a wide range of autonomous tasks, including deleting a production database, is an invitation to disaster. As such, each organization is going to need to be able to define a level of trust they have in any given AI agent, notes Weisberg.

Hopefully, the pace at which testing can be conducted will align with the rate at which AI agents are being built and deployed. There is an understandable rush to deploy AI agents for fear of falling behind rivals. In reality, however, every organization is encountering the same issues operationalizing AI agents so the degree to which any one of them is falling behind is debatable.

The one thing that is certain is there will soon be thousands of AI agents strewn across the enterprise. The challenge now is finding a way to ensure that AI agents, much like any other application, are consistently meeting quality assurance (QA) goals.