Testlio today extended the reach of a crowdsourcing platform to include an ability to test large language models (LLMs).
The Testlio platform provides access to a network of more than 80,000 professional software testers that it has vetted. Those testers are members of the Testlio Academy and, at the very least, have completed a foundational “Introduction to Testing AI-Powered Systems” course.
The Testlio platform was originally created to provide an artificial intelligence (AI) service for evaluating applications as they are being developed. Now the reach of that platform is being extended to also evaluate the LLMs that are the foundation upon which many new applications are based, says Dean Hickman-Smith, chief revenue officer for Testlio.
That approach makes it possible to validate AI model behavior in real-world conditions across languages, devices, and regions to better detect and mitigate hallucinations, bias and compliance issues, he added.
In effect, organizations can now leverage a network of professional human testers of software to create a red team to discover prompt injections, jailbreaks, and vulnerabilities to prevent them from becoming incorporated into a production environment, said Hickman-Smith.
Organizations can also via the Testlio platform continuously monitor LLM performance to identify drift, regression, and degradation issues, he notes.
Beyond core model testing, the Testlio platform assesses latency, response formatting, contextual accuracy, and integration reliability.
Those testing results can then be fed back to the providers of LLMs to create a reinforcement learning loop that ultimately improves the quality of the output, says Hickman-Smith.
Earlier this year, Testlio launched LeoAI Engine and LeoMatch, a set of proprietary technologies that accelerate test orchestration and talent pairing that were trained using more than 2.6 million test cases involving more than 600,000 devices.
It’s not clear to what degree organizations are testing the code generated by AI coding tools but the volume of code is clearly overwhelming many existing testing workflows, said Hickman-Smith. “It’s outpacing the ability to validate,” he adds. Early testing of LLMs conducted by members of the Testlio crowdsourcing network indicates there are substantial LLM issues, with 79% of bugs were classified as medium or high severity, 82% of issues involving hallucinations or misinformation, particularly in chatbot and retrieval-augmented generation (RAG) systems. Those tests suggest that LLMs are blending facts with fabricated details at a significantly higher rate than most organizations fully appreciate.
As more organizations start to realize the extent to which AI coding tools are creating vulnerabilities along with inefficient code that is costly to run, a pressing need to revisit how tests are conducted is becoming apparent. Despite those concerns, it’s not likely that developers are going to abandon those tools but a role for humans that have the expertise required to test LLMs is emerging. The issue then becomes expanding the base of IT professionals capable of conducting those tests continues to expand. Otherwise, it’s only a matter of time before all the code being generated by AI coding tools becomes too much of something that might not necessarily be a good thing after all.

