Google Research Study: Scaling Multi-Agent Systems is a Strange Science

A study emanating from Google suggests that more AI agents are not always a recipe for a better total agentic deployment and end-user experience. An 18-strong Google Research and Google DeepMind team carried out a study to address the fact that, despite widespread adoption and growing acceptance, most agentic analysis today remains focused on capabilities, usability and creative suitability to enterprise use cases.

The question of multi-agent scalability remains less widely analyzed.

Because the principles that determine agentic performance in multi-agent environments remain underexplored, the suggestion here is that software engineers and other practitioners working with these technologies have to rely on heuristics (or mental rules of thumb), rather than principled design choices.

The research evaluated across four diverse benchmarks: Finance-Agent (as it sounds agentic finance functions), BrowseComp-Plus (human-curated evaluation benchmark for AI “deep-research” or “search” agents designed to provide a transparent and reproducible way to test how well AI models), PlanCraft (described as spatiotemporal planning under constraints and actually related to Minecraft, see below) and Workbench (common business activities).

“Using five canonical architectures (Single, Independent, Centralized, Decentralized, Hybrid) instantiated across three LLM families, we performed a controlled evaluation spanning 180 configurations with standardized tools and token budgets,” detailed the Google Research team.

From WorkBench to Minecraft

BrowseComp Plus. BrowseComp Plus contains 100 web browsing tasks requiring multi-website information synthesis. Tasks include comparative analysis, fact verification and comprehensive research across multiple web sources.

WorkBench evaluates business task automation through function calling sequences. The dataset covers five domains: analytics, calendar management, email operations, project management and customer relationship management. Success in this test requires executing correct tool sequences to accomplish realistic business workflows.

Plancraft focuses on sequential planning in Minecraft environments. Agents must craft target items by determining optimal action sequences using available inventory and crafting recipes. Tasks require multi-step reasoning about dependencies, resource management, and action ordering.

To determine when multi-agent coordination provides benefit, the researchers first established which task categories require agentic capabilities. A critical prerequisite is distinguishing between agentic and non-agentic evaluation paradigms. So what were they looking out for?

Overhead, Error Amplification & Redundancy

The researchers derived a predictive model using empirical coordination metrics, including efficiency, overhead, error amplification and redundancy.

“Under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead,” write the team. “Once single-agent baselines exceed [a certain percentage level], topology-dependent error amplification [occurs] and independent agents amplify errors 17.2x through unchecked propagation, while centralized coordination contains this to 4.4x.”

Composed of deeply technical language throughout, this study found that centralized coordination improves performance by 80.9% on parallelizable tasks like financial reasoning, while decentralized coordination excels on dynamic web navigation. Yet for sequential reasoning tasks, all multi-agent variants degraded performance by 39-70%.

“Our findings suggest that multi-agent benefits depend critically on task structure rather than team size alone. Effective system design requires matching coordination topology to problem characteristics, rather than assuming uniform benefits from scaling agent count,” conclude the team.

Google Research Study: Scaling Multi-Agent Systems is a Strange Science

From WorkBench to Minecraft

Overhead, Error Amplification & Redundancy

SHARE THIS STORY

FOLLOW US

Google Research Study: Scaling Multi-Agent Systems is a Strange Science

From WorkBench to Minecraft

Overhead, Error Amplification & Redundancy

TECHSTRONG AI PODCAST

SHARE THIS STORY

RELATED STORIES:

FOLLOW US

NEWSLETTER SIGN UP