Vendors make bold claims about the capabilities of multi-agent AI systems, but many enterprises are a long way from deploying these systems, and farther still from achieving success. According to Gartner, only 8% of organizations have AI agents deployed in production. Success rates for these enterprises plummet when agent workloads scale, while compound reliability drops below 50% after 13 sequential steps. And that assumes a good deal of optimism about reliability.

While there are several traps enterprises encounter in their pursuit of a multi-agent future, agent interoperability and technical hurdles around data access, governance and workflow are driving a new kind of data infrastructure called agentlakes.

What are Agentlakes?

Agentlakes, as defined by Forrester, are composable architectures used to manage, orchestrate, and enable multi-agent systems. An agentlake is not an agent framework. It is the shared data, control, and governance substrate agents depend on to operate at scale. But are agentlakes in use today?

This new concept has several components:

  • Memory and state management used to share context and recall previous interactions. This is commonly vector or graph databases, or something simpler like key-value stores.
  • Communication fabric to support agent interactions using message brokers and event streaming systems, as well as MCP and A2A (agent-to-agent) protocols.
  • Governance, implemented with a combination of policy engines and audit logging.
  • Orchestration, including both task routers and agent lifecycle management.

While agentlakes promise to address interoperability, governance, and orchestration challenges in multi-agent systems, most are designed around data volumes and access patterns common to business applications. When these systems fail, they tend to fail quietly, through increased latency, rising query costs, or gradual degradation in agent accuracy.

Designing Telemetry Agentlakes

Telemetry data changes the equation. Telemetry data arrives continuously, at machine scale, and in volumes that can reach hundreds of terabytes per day. In this environment, traditional agentlakes fail catastrophically. Costs spike unpredictably, agents lose visibility into critical signals, and incidents are missed rather than merely delayed. Supporting autonomous agents over telemetry data therefore requires a fundamentally different kind of agentlake, one built to operate safely, economically, and reliably at extreme scale.

First, a telemetry agentlake needs effective data tiering since you’re not cost-effectively storing hundreds of terabytes per day in a database. Your data tiering needs include a:

  • Hot tier, like Cribl Lakehouse, Apache Druid, or ClickHouse. This is where the agents live, querying directly for anomaly detection.
  • Cold tier, like a traditional data lake or object storage. Deep forensics and long-term trend analysis takes place within this tier.

Of course, data ingestion is a key part of a telemetry agentlake. In addition to the core capabilities of a telemetry pipeline, feeding a telemetry agentlake means parsing and vectorizing data at ingest to extract key entities before hitting your hot or cold tiers.

Next up, a federated RAG component must support on-the-fly indexing and metadata-first retrieval:

  • On-the-fly indexing allows agents to detect a signal of interest, then trigger an indexing of the relevant period of logs into a temporary vector space to perform analysis.
  • Metadata first retrieval allows agents to narrow the query space for cost and analysis optimization.

Governance is the next major difference between a business agentlake and a telemetry agentlake. Since you can’t realistically govern a few hundred terabytes of data per day, your governance capabilities must work on dynamic sampling. Priority-based routing allows agents to look at 1-2% of logs, ramping up to 100% if an anomaly is detected. However, an agentlake also requires cost governance to prevent agents from querying SELECT * on petabyte-scale tables, which could cost thousands of dollars on a single API call.

Wrapping Up

Let’s face it: autonomous agents face real reliability challenges. As workflows lengthen, errors compound, reasoning drifts, and success rates fall. These limitations will constrain where full autonomy makes sense in the near term, and they should not be ignored.

But even highly reliable agents will fail without the right data infrastructure. Telemetry arrives continuously, at machine scale, and in volumes that overwhelm architectures designed for business data. When agents are allowed to query, retrieve, and reason over petabyte-scale telemetry without scoped memory, tiered access, and cost controls, failure shifts from a possibility to a certainty. Costs spike, signals are missed, and systems become less observable rather than more autonomous.

Telemetry agentlakes may address this inevitability. They do not solve agent reasoning or eliminate reliability risk. What they propose to do is make multi-agent autonomy viable by enforcing constraints: ephemeral memory instead of permanent indexing, signal-driven retrieval instead of exhaustive search, and budget-aware governance instead of blind querying.

Enterprises evaluating autonomous agents should treat reliability as a risk to be managed, and data infrastructure as a prerequisite to be met. Without a telemetry-aware agentlake, autonomy will fail quickly. With one, organizations gain the foundation required to decide where autonomy belongs and where it does not.