Multi-agent systems have cleared the proof-of-concept stage faster than most engineering organizations were even ready for. What started as experimental frameworks running in research notebooks and weekend demos has now started running in production pipelines at companies whose infrastructure decisions matter at scale.
The failure pattern emerging from that transition is consistent enough to name directly. Multi-agent systems do not fail at the model level. They fail at the architecture layer surrounding the models, in the assumptions that teams carry from single-agent deployments into environments where 10, 50 or 5,000 agents need to coordinate without turning coordination itself into the bottleneck.
But the assumptions that break first are the oldest ones. That internal communication is safe. That state can travel as text between one agent and the next without validation. A well-prompted model will know when to stop, when to escalate and when to reject a result from a peer agent rather than incorporate it and keep doing.
Shahid Ali Khan, principal engineering DevOps at TestMu AI (formerly LambdaTest), adds a fourth assumption to that list, one that surfaces later than the others and tends to be more expensive to fix. The identity model built around human engineers can be adapted incrementally to manage agents that are instantiated, cloned and decommissioned at a rate no static credential system was designed to handle.
These are not exotic failure modes. They are the predictable consequences of applying single-agent intuitions to a fundamentally different class of system, and the engineering teams building past them share a common diagnosis. The problem is not the intelligence of the agents. It is the discipline of architecture connecting them.
One Agent Cannot Do Everything Well
Itamar Friedman, CEO and co-founder of Qodo, frames the architectural gap in terms that practitioners building on today’s agent frameworks will recognize immediately. The instinct to use the same model for every task in a workflow, coding, reviewing, routing and testing, feels like efficiency but produces the opposite in practice. Coding is a generative process that rewards a model’s ability to produce novel solutions under ambiguous constraints. Code review is a standardization process that requires surfacing deviations from established norms, catching the security issues a generative pass would miss and resisting the creative impulse to suggest alternatives rather than evaluate what is already there.
“Doing the same thing, using the same model and the same agent to do both tasks, is counterproductive,” Friedman argues. “It is like giving the fox the job of guarding the hen house, not because the model is untrustworthy, but because the behavioral properties required for each task are in direct tension with each other.” The implication is architectural rather than a matter of prompt engineering. Building a multi-agent system that can reliably standardize quality alongside generating output requires decomposing the problem into designated agents with specific behavioral constraints for each role and using the orchestration layer to enforce which agent touches which part of the workflow. To explain what that orchestration layer should actually look like, Friedman goes further, drawing a parallel to how companies and countries work. There are policies, rules about what comes before the other, a division of departments each in charge of something specific and a judge that knows the rules and gives verdicts based on a defined set of practices. That, he argues, is the mental model for what a mature multi-agent AI system actually is, and most deployments today are not building to it.
The State Problem Nobody Budgets For
When Agent A completes a subtask and hands it to Agent B, what travels between them determines whether the downstream agent is working from reality or from a hallucinated reconstruction of it. Most early implementations default to raw string prompts because they are simple to implement and easy to inspect, but they transfer the burden of interpretation to the receiving agent and create conditions for hallucination to compound across handoffs. Jason Hillary, CTO and co-founder of Zerve AI, has built a three-level architecture that treats this problem as a first-class engineering concern rather than an implementation detail left to the model to resolve.
The first level is database-backed context persistence, where the agent runs store context as JSONB that survives the handoff intact and can be queried by the receiving agent rather than reconstructed from a summary. The second is tool call fetching, where receiving agents query for canvas state, block contents and execution results through dedicated tools rather than relying on what the sending agent chose to include in its message. The third is structured handoff contracts, where a formal handoff class carries context override data validated against the receiving agent’s expected schema before the task proceeds. “This eliminates hallucination during handoffs because agents cannot invent or misinterpret state,” Hillary explains. “They must fetch it from the database through tool calls or receive it via validated structured contracts.” When Agent A delegates a canvas modification to Agent B, the task specification includes the exact canvas ID, the operation type enum and the parameter schemas that Agent B’s tools expect. Agent B then fetches the current canvas state through database queries rather than relying on Agent A’s description of it.
The cost of this discipline shows up as latency and is worth paying. The architecture adds routing and validation overhead at every handoff but eliminates the category of failure where a downstream agent operates confidently on incorrect state and produces output that looks valid until it reaches a human reviewer three steps later. Reliability comes from making the handoff explicit, typed and verifiable rather than implicit and trusting.
Orchestration Patterns and Their Real Trade-Offs
The orchestration debate in multi-agent systems tends to polarize around two positions. Directed acyclic graphs offer deterministic execution paths that are easy to test, audit and reason about, but they require the system designer to anticipate the full range of workflows in advance. Supervisor agents offer dynamic routing that handles unanticipated task combinations but add latency at every decision point and introduce non-determinism that makes testing harder. Hillary runs a hybrid that takes the productive features of both without accepting either position entirely.
The DispatchAgent at the routing layer uses tool-based handoffs to route requests to specialized agents dynamically, handling cases that a static DAG would need to be redesigned to accommodate. However, the execution layer underneath follows DAG patterns through an AgentRun and AgentTask system where child runs execute in parallel via ARQ workers with dependencies declared through database foreign keys. The supervisor decides where the work goes. The execution layer decides how it gets done, deterministically and in parallel. “Autonomous routing adds 200–400 milliseconds of LLM latency per routing decision,” Hillary notes, “but parallel DAG execution reduces total workflow time by 60–70% compared to sequential agent chains.” The tradeoff is explicit, measurable and worth making because the parallel execution underneath the routing layer is doing real work while the routing decision is being made.
MCP and the Transport Decision Teams
The model context protocol (MCP) has quickly become one of the most discussed architectural decisions in multi-agent deployments, and the discussion has been distorted by framing the stdio versus SSE transport choice as a best-practice debate when it is actually a deployment model question. Suyash Joshi, senior developer advocate at InfluxData, makes this distinction with the precision the conversation has been missing.
“MCP does not prescribe a single transport as a best practice,” Joshi explains. “It is designed to be transport-agnostic, with the core value coming from standardized tool discovery, description and invocation across runtimes. That structure remains consistent regardless of whether communication happens over stdio or a networked protocol such as SSE.” SSE is the right transport for centralized MCP services serving multiple remote or browser-based clients, and it carries the operational overhead that centralized services genuinely require, including networking, authentication, concurrency management and life cycle coordination. None of that overhead is unnecessary in the environments where SSE belongs.
However, in local and edge scenarios where the LLM, the MCP server and the database run on the same machine, that overhead works against the requirements rather than satisfying them. For the InfluxDB 3 MCP Server, Joshi’s team uses stdio specifically because the IoT and edge scenarios it serves have properties that SSE undermines. Offline operation, low latency, minimal configuration and data privacy constraints that make crossing a network boundary unacceptable are all properties that stdio preserves by avoiding a network boundary entirely. In addition, beyond the performance argument, stdio enables tightly bounded deployments with clearer failure modes and simpler debugging, which matter more than centralized accessibility in environments where reliability and locality are the primary requirements. The decision about which transport to use should follow the decision about what the deployment context requires, not precede it.
The Identity Gap That Surfaces Late
The transport decision is one layer of the infrastructure conversation. However, another layer has received almost no attention at all. The infrastructure conversation around multi-agent systems has focused heavily on compute and latency while largely overlooking the identity problem that emerges when agents begin making routing, execution and delegation decisions autonomously. Khan, speaking from his experience managing agentic infrastructure at scale, identifies this gap as the most consequential one that teams discover after deployment rather than before it.
“Traditional CI/CD pipelines were designed around the assumption that a human engineer made the last significant decision in every chain,” Khan observes, “and the IAM models built around those pipelines reflect that assumption.” When agents make routing and delegation decisions at a rate and volume that human oversight cannot follow, the identity management layer has to change fundamentally rather than adapt incrementally.
Agents can be instantiated, cloned and decommissioned at a rate that makes static credentials operationally dangerous and practically unmanageable. What the architecture requires instead are ephemeral, task-scoped credentials that are issued for a specific workflow step, live only as long as that step requires and cannot be inherited or reused by a subsequent agent in the chain. At TestMu AI, Khan’s team found that the biggest infrastructure gap in moving to agentic workflows was not compute or latency but enforcing that identity boundary at the orchestration layer rather than relying on agents to self-limit. The orchestration layer has to issue credentials, enforce their scope and expire them, not as a security add-on but as a core property of how agent execution is managed.
When the Architecture Fails
Albert Ziegler has spent years at the intersection of AI systems and enterprise-scale engineering, first as principal researcher at GitHub, where he built GitHub Copilot and GitHub Advanced Security, and now as Head of AI at XBOW, where his team runs assessments that can involve 5,000 agents working in coordination. The failure mode he describes from that operational experience is specific and entirely preventable once you know what you are looking at.
Early agentic systems allowed any agent to write to shared memory and kick off other agents without validation, and that architecture produces cascade failures with a consistency that makes the pattern impossible to ignore. An agent that believes it has found a result will try to propagate it, and in an unconstrained system it will succeed regardless of whether that result is correct. The orchestration system’s job, as Ziegler frames it, is not to trust agent output. It is to validate every result before it enters any shared knowledge base, to control agent kickoff so that a malformed result cannot spawn a chain of downstream agents operating on false state and to enforce that every handoff is a sensible one before it proceeds.
“Whenever any agent tries to hand off to another agent, first there must be a check whether that is a sensible handoff,” Ziegler argues. “Agent kickoff needs to be tightly controlled, agent mission setting needs to be tightly controlled and as deterministically as possible and any ways that the agent can write to some shared memory need to be tightly controlled to keep results as factual as possible.”
Khan’s conclusion from the identity side mirrors Ziegler’s from the validation side. The governance question for multi-agent systems is not whether to impose these controls but how to build them into the architecture from the beginning rather than retrofitting them after a cascade failure makes their absence visible.
The teams building multi-agent systems that hold up in production have accepted a counterintuitive premise. More agents means more discipline, not less. Every new agent added to a workflow is another point where unvalidated state can enter the system, another decision boundary where the wrong routing choice can compound and another identity that needs to be bounded and audited. The architectural work of controlling those boundaries is unglamorous and it does not show up in benchmark results. But it is the difference between a multi-agent system that impresses in a demo and one that is still running reliably six months after deployment, and that difference is entirely architectural.

