As groundbreaking a technology as generative artificial intelligence (AI) is, there are clearly trust issues. The suggestions surfaced by large language models (LLMs) are prone to everything from simple mistakes to outright hallucinations that are beyond belief. Data science teams clearly need tools to identify the root cause of those errors long before any potential harm is inflicted.
Today, deepset is moving to address that challenge by adding a Groundedness Observability Dashboard that measures the quality of the output being generated by an LLM. The deepset Cloud is a platform for accessing multiple large language models (LLMs) that makes it simpler to build AI applications using LLMs originally developed by, for example, OpenAI and Anthropic. The overall goal is to make it simpler to build AI applications using reusable components and templates.
The Groundedness Observability Dashboard extends the platform to now include an ability to create scores for measuring the precision and fidelity of the responses generated using the source documents that were employed to train the AI model. Those results can then be used as a guide for fine tuning the LLM, the prompts being used to expose the LLM to additional data.
Those metrics can also be used to identify inefficient processes that are also driving up the LLM costs says Mathis Lucka, head of product for deepset. “You can optimize costs,” he adds.
Finally, deepset has also added annotation capabilities to make it easier to review source material as part of any effort to fact check output.
The deepset tools are part of a larger effort to create a trust layer that ensures LLM accuracy, says Lucka. If organizations don’t have confidence in the output being generated by an LLM because too many errors are being generated, or there are outright hallucinations, it will ultimately slow enterprise adoption, he notes.
There is, of course, a world of difference between using an LLM to generate, for example, a more compelling email that an individual end user should review before sending and embedding generative AI platforms into business workflows. The tolerance for errors and hallucinations in workflows that impact customers is near zero. The challenge the enterprise IT organizations face today in determining the root cause of issue is problematic, given all the data sources used to train an AI model. The tools provided by deepset provide data science teams with a means to triage problems, hopefully before an AI model is deployed in a production environment. That’s critical because the cost of retraining AI models after they have been deployed is considerable.
It’s still early days as far as operationalizing AI within organizations, but it’s apparent there is a need to observe them in a way that makes it simpler to identify problematic data sources. Most large enterprises have mountains of conflicting data that can easily result in an LLM making suboptimal recommendations. The issue is that many of those mistakes, without tools to identify them, might be too subtle to catch until after a significant amount of damage has been inflicted.
One way or another, auditors are going to eventually require data science teams to document how any LLM was trained. The sooner the tools required to document those processes are in place, the more comfortable everyone concerned about the accuracy of AI platforms is going to be.