Vast amounts of unstructured enterprise data were always a problem worth solving. However, for years, the tools available to solve it required more engineering effort than most teams could justify.

Retrieval-augmented generation (RAG) swept through enterprise AI conversations with an astonishingly simple promise. Connect your documents to a language model, let the model retrieve what it needs before it answers — and the gap between a generic AI assistant and one that knows your business closes itself. The tooling ecosystem that grew up around that promise moved fast enough that most teams were running their first RAG pipeline before they had a clear picture of what it would actually require to make it production-ready. 

In addition, the gap between what teams expected to build and what they actually built is where a significant amount of engineering time and organizational credibility has quietly disappeared. The problem is not that RAG does not work; it is that most teams treat it as one problem when it is structurally two — building the solution to one while leaving the other largely unaddressed. The consequences surface not in the demo but three months into production when retrieval quality degrades and edge cases accumulate, and the AI that impressed the stakeholder presentation starts giving answers that nobody on the engineering team can defend. 

The Expectation That Created the Gap 

Carlos Rolo, a dedicated open-source community leader with contributions to Cassandra, OpenSearch and Cadence, and an engineering expert at Instaclustr, a NetApp company, describes the dynamic that sets most RAG projects up for a harder road than they anticipated. Organizations walk in with PDFs, Word documents, SharePoint content and the reasonable expectation that a RAG pipeline will ingest it all and make it queryable. However, the reality of what that data actually contains is where the expectation meets the engineering. Tables with ambiguous commas, images and charts that extraction tools cannot reliably interpret, inconsistent schemas and legacy content that predates any formatting standard the current team would recognize. None of it is what the vendor pitch prepared the team for because the vendor pitch was about the retrieval and generation layers, rather than the data preparation layer that both of those depend on entirely. 

“I have seen very robust RAG pipelines fall over because a table had a comma in it, which is immediately understandable by us humans, but it completely breaks down the RAG pipeline,” Rolo explains. “Then you cannot make any retrieval.” The failure mode is mundane and the fix is not complicated once it is identified, but it represents a category of problem that teams consistently arrive unprepared for. 

Mayank Bhola, co-founder and head of products at TestMu AI, sees this expectation gap play out consistently across enterprise teams. “The teams that struggle most with RAG in production are rarely struggling with the model layer. They come in with strong instincts about which LLM to use and how to prompt it, but they have not asked the harder question, which is whether the data those models are going to retrieve is actually ready to be retrieved. That readiness problem does not announce itself in the demo because demos are curated. It is exposed weeks later into production when the edge cases the demo never touched start hitting the retrieval layer and the system has no mechanism to handle them.” 

The open-source community has recognized the gap. Projects such as Docling from IBM and MarkItDown from Microsoft are addressing the document processing problem specifically because the problem turned out to be both harder and more universal than the first wave of RAG tooling acknowledged. 

The Governance Problem Nobody Budgets For 

The governance dimension of data readiness is the part of the problem Rolo identifies as most consistently underestimated, and the one that creates the most friction in enterprise deployments where the data being ingested is not just messy but sensitive. Once data is in the RAG pipeline, the questions of who can access it, who owns it and whether the retrieval system can enforce access controls at the document level rather than just the query level become operational requirements rather than compliance considerations. 

“Is this being seen by someone’s eyes that should not be seeing it?” Rolo asks, framing the question that organizations running cloud-based AI systems over enterprise content have to answer before they can call the pipeline production ready. Moreover, in the rush to move fast, it is the question that gets answered last rather than first. 

Alex Merced, head of developer relations at Dremio, frames the governance problem in terms that go beyond access controls and into the semantic layer that sits underneath them. “Most enterprises do not actually have an embedding problem. They have a context problem. Their data is fragmented across systems, poorly described, inconsistently modeled and governed through processes that were never designed for autonomous or semiautonomous agents.” 

Without a governed semantic layer, Merced argues that AI systems will retrieve information that is ambiguous, incomplete, stale or outright unauthorized regardless of how strong the underlying vector search is. “Better embeddings simply make the wrong answers faster.” The teams that avoid this outcome treat RAG as a data platform challenge first rather than an LLM feature, starting from open lakehouse foundations that unify data access, enforce governance and provide a shared semantic understanding of business concepts before the retrieval layer is even built. 

Search First, Then Context 

The architectural insight that Rolo arrived at through building and debugging RAG pipelines is the one most teams discover too late and have to rebuild around. RAG is a two-step problem, and the two steps have different failure modes that require different solutions. The first step is search — finding the right chunks of content from the knowledge base to pass to the model. The second step is context — giving the model those chunks in an order and format that lets it reason about them effectively. Most implementations treat the second step as the hard problem and underinvest in the first, and then spend months debugging retrieval quality when the real failure is happening upstream. 

“I improved my retrieval massively because I was missing searches and then my content was miserable for the LLM,” Rolo recalls, describing the moment he stepped back from vector-based retrieval and reintroduced keyword search into the pipeline. The instinct in the current tooling landscape pushes toward vectors for everything because vector search is what the current generation of AI infrastructure is built around and because the marketing around it positions semantic similarity as a universal improvement over keyword matching. However, keyword search for precise term retrieval and vector search for semantic similarity are solving different retrieval problems, and the hybrid approach that uses each where it is strongest outperforms either used exclusively. 

Merced sees the same over-correction playing out across enterprise teams from the architecture side. “Teams routinely over-optimize the vector layer, tuning embeddings, swapping vector databases or chasing marginal recall improvements, while underestimating issues like data governance, semantic consistency, hybrid retrieval strategies and end-to-end latency constraints.” Standards such as MCP and other open interfaces, he argues, allow AI agents to reason over trusted data, combine symbolic and vector-based retrieval and operate across tools with predictable performance, enabling speed, safety and flexibility without locking teams into brittle pipelines or over-specialized infrastructure that cannot evolve as models and use cases change. 

The database decision that supports this architecture deserves more deliberate treatment than it typically receives. Rolo’s practical advice for organizations navigating a landscape where vector databases, graph databases and traditional relational and document stores are all competing for the same budget line is grounded in the operational reality of what teams are actually able to maintain. “The best advice in the database world is the database you already have,” he argues. “If you already have a database and that database supports vectors, start there. Do not reinvent the wheel.” Postgres with pgvector, OpenSearch with ML Commons, Cassandra with vector search — these are not compromises. They are mature infrastructure choices that preserve the operational knowledge a team has already built and avoid the organizational cost of adopting and learning a new system while simultaneously trying to build a reliable AI product. 

What Happens When You Give the Model Too Much 

The context side of the two-step problem has its own failure mode that has become more visible as teams have pushed toward models with larger context windows and discovered that larger capacity does not translate linearly into better performance. Rolo is direct about the pattern that has emerged from teams that tried to solve retrieval quality problems by sending more content to the model. “Having a big context window with low quality, even if somewhere within the low-quality context there is the context you need, is not a good technique,” he points out. Models that receive too much content lose track of what is relevant in the middle, a failure mode known as ‘context rot’, and the model that could reason clearly about three precise chunks becomes unreliable when those same three chunks are buried inside 30 that were only marginally related. 

The implication for retrieval architecture is that precision matters more than recall at the context stage. The goal is not to retrieve everything that might be relevant but to retrieve exactly what is needed and nothing that is not. The model’s ability to reason about the content degrades as the content-to-signal ratio falls, and different models handle this differently. Rolo underscores that teams need to benchmark their specific model against their specific data to understand where the degradation begins, rather than assuming a general rule applies universally. “Each LLM performs differently,” he notes. “It is not that LLM A is better than LLM B or C. It is about benchmarking our own LLMs to understand how they manage context.” 

The synchronization problem adds another dimension that the tooling ecosystem has not yet settled on a standard answer for. Keeping a vector index synchronized with a high-transaction operational database is a harder engineering problem than the initial pipeline build, and it compounds over time as the production data distribution evolves away from the data the index was built on. “Graph is very slow,” Rolo observes, “and if you recompute the graph, there goes your data velocity.” On-the-fly embedding computation avoids the stale index problem but trades it for latency. The honest assessment of where the field is right now is that the synchronization problem is unsolved, and the teams building the most reliable RAG pipelines are the ones that have been honest with themselves about that limitation and designed their systems around explicit refresh cycles and retrieval quality monitoring rather than assuming the pipeline will maintain its own accuracy over time. 

What to Build Around 

The RAG projects that have delivered durable value share a set of architectural commitments that are less about technology choices and more about sequencing and ownership. Data readiness is treated as a prerequisite rather than a parallel track. The retrieval layer is treated as a first-class engineering concern with its own quality benchmarks, monitoring and improvement cycle. The context assembly layer is built around what the specific model can handle reliably rather than what the model theoretically supports at maximum capacity. In addition, governance is addressed at the design stage rather than the compliance review stage. 

Bhola of TestMu AI frames the operational consequence of getting this sequencing wrong in terms that engineering leaders will recognize from their own post-deployment experience. Speaking from his experience building KaneAI, TestMu AI’s end-to-end testing agent, “What we have consistently seen at TestMu AI is that the teams who treat RAG as a one-time build rather than an ongoing engineering discipline are the ones who end up spending more time in fire-fighting mode than they saved by moving fast on the initial implementation. The retrieval layer degrades quietly, the data distribution shifts and the model gets updated. None of those changes trigger an alert because the system is still functioning. It is just functioning worse than it was three months ago, and nobody has instrumented for that.” The teams that avoid that outcome are the ones that establish dedicated ownership of the retrieval pipeline, run retrieval quality benchmarks on a regular cadence and treat a shift in the data distribution as a signal to re-evaluate the pipeline rather than a background condition to monitor and ignore. 

“The biggest risk you have is not using AI,” Rolo concludes, and the observation cuts against the narrative that the complexity of production RAG is an argument for caution. The engineering problems are real and the path through them is demanding. However, the teams that have navigated them have built AI capabilities that are genuinely useful at scale, and the ones that treated the complexity as a reason to wait are still waiting. The gap between a RAG system that works in a demo and one that works in production is an engineering gap, not a technology gap, and the discipline to close it is available to any team willing to take both steps of the problem seriously rather than just the second one.