Are You Measuring the Quality of Your AI Agent’s Web Search Results?

Every serious engineering team running AI agents in production has a monitoring stack. You’re tracking latency, watching error rates, and keeping an eye on token spend because the bill has a way of surprising people. You also probably have alerts on output quality and hallucination rates, and maybe some custom evals tuned to your specific use cases.

But here’s a question worth sitting with: do you know what percentage of what your agent actually reads is relevant to the question it’s trying to answer?

For most teams, the honest answer is no. It’s not because they don’t care, but because nothing in the standard tool set surfaces it. And based on a recent analysis of 250 real-world web search retail queries, spanning prices, availability, ratings, discounts, and product specs, the number is worse than most people would guess. Across all query types, fewer than 0.4% of the content agents retrieved were actually relevant to the questions they were asked to answer. For price queries specifically, agents were pulling pages averaging more than 9,000 characters to extract answers that were typically four to six characters long.

A dollar amount. That’s it. What’s clear is this isn’t a model problem. It’s a pipeline problem, and it probably belongs on your dashboard.

There’s An Observability Gap Nobody’s Talking About

There’s a pattern here that should feel familiar to anyone who’s been in DevOps long enough. In the early days of distributed systems, teams monitored whether services were running. Things like latency, error budgets and downstream dependencies came later, once the field matured enough to understand what “running well” actually meant in practice. Right now, the retrieval layer in most agentic pipelines is at that earlier stage. Teams know if retrieval is failing, but they don’t necessarily know if it’s working well.

The metric worth introducing here is simple: the ratio of useful content to total content retrieved. Some call it “signal-to-noise” at the retrieval layer. For a given query, what’s the minimum string that correctly answers it, and what fraction of everything the agent reads does that represent? Run that across a representative sample of your query types, and you have a baseline. The caveat? Most teams that do this exercise aren’t happy with what they find.

The reason this failure stays invisible is that it rarely causes outright errors. The agent still returns something, and even often something plausible. The problems are subtler: costs that are higher than they should be, accuracy that’s inconsistent in ways that are hard to reproduce, edge cases where the model confidently returns the wrong answer, and nobody can figure out why.

That last one is the more serious of the two consequences.

Two Problems From One Root Cause

Take a look at a product page for almost any consumer electronics item on a major retail site. That page contains the current price, of course, but it also contains a struck-through “list price,” prices for different configurations and storage tiers, bundle pricing, sponsored product prices sitting above the fold, and prices that appear in customer reviews (“I paid $X last year and feel ripped off”).

The correct answer is on the page, but so are a dozen plausible-but-wrong answers.

When you give an LLM 30,000 characters of content and ask it to identify a specific product’s price, you’re not testing the model’s reasoning. You’re testing its ability to pick the right needle out of a haystack full of very similar-looking needles. Research on how large language models behave in long contexts is pretty consistent on this point: accuracy degrades as context grows and as the density of relevant-seeming but incorrect information increases. The model is smart enough, but it’s failing because the input structurally confuses it.

Now layer in the dimension of AI costs. At roughly four characters per token, a 9,000-character page is around 2,200 tokens. At 1,000 queries a day, you’re processing roughly 8.5 million characters of content that contribute nothing to any answer. It’s simply bad business: a sustained noise rate above 95% on a given query type means the retrieval strategy for that query type needs to be rethought, not “tuned.”

The natural instinct is to use a better model, increase context length and improve the prompts. But those things don’t fix the underlying issue. If the overwhelming majority of what your agent is reading is irrelevant, you’re asking the model to compensate for an architecture that’s not designed with accuracy in mind.

Fixing it starts with measurement. Define what a correct answer should actually look like for each query type your system handles (not the document it lives in, the answer itself). Log the length of the retrieved content alongside it. And compute the ratio. If you do this across enough queries, you’ll create a basic understanding of your baseline by query type, because the numbers vary significantly depending on what’s being asked. Structure-heavy tasks like checking a yes/no availability flag tend to be noisier than open-ended description queries, even though they may feel simpler on the surface.

Once you have the baseline, you have something to work with and take action on. Query types with consistently high noise rates are candidates for “targeted extraction,” pulling only the relevant section of a page, or going directly to a structured data source rather than the full document. Again, that’s an architectural decision, not a prompting decision, so it belongs earlier in the design process than most teams currently tend to place it.

The teams building AI agents that will hold up over time are treating it as a data-quality infrastructure problem rather than focusing on using the most sophisticated models. You shouldn’t discount choosing the best model that works for your specific needs, of course, but those who are winning with AI agents tackle the data-quality infrastructure problem from the beginning and build the observability to prove it.

Are You Measuring the Quality of Your AI Agent’s Web Search Results?

There’s An Observability Gap Nobody’s Talking About

Two Problems From One Root Cause

SHARE THIS STORY

FOLLOW US

Are You Measuring the Quality of Your AI Agent’s Web Search Results?

There’s An Observability Gap Nobody’s Talking About

Two Problems From One Root Cause

TECHSTRONG AI PODCAST

SHARE THIS STORY

RELATED STORIES:

FOLLOW US

NEWSLETTER SIGN UP