As tech giants race to pitch artificial intelligence (AI) agents as the future of autonomous labor, a sobering new study from Microsoft Research suggests these digital assistants may need a much tighter leash.
According to researchers, even the most advanced frontier models tend to “corrupt” documents and lose critical data when tasked with long, multi-step workflows.
The study, titled “LLMs Corrupt Your Documents When You Delegate,” arrives at a pivotal moment. Companies like Anthropic and Microsoft Corp. have heavily marketed the ability of AI to handle complex research and autonomous task execution. However, Microsoft scientists Philippe Laban, Tobias Schnabel, and Jennifer Neville found that delegating work to AI often results in a “catastrophic” loss of information.
To test the reliability of these systems, the team developed DELEGATE-52, a benchmark simulating 52 professional domains ranging from crystallography to accounting. The results were startling: frontier models, including GPT-5.4, Gemini 3.1 Pro, and Claude 4.6 Opus, lost an average of 25% of document content over the course of 20 interactions. Across all models tested, the average degradation reached a staggering 50%.
The researchers set a “readiness” bar at 98% accuracy. Out of dozens of domains, only Python programming met the standard. In 80% of the simulated conditions, the models severely corrupted documents.
“Our findings show that current LLMs introduce substantial errors when editing work documents,” the authors reported. Interestingly, the type of failure varied by the quality of the model. Weaker models tended to delete content entirely, while more advanced frontier models suffered from content corruption. What is more, these errors rarely happened gradually. Instead, they occurred in “catastrophic” bursts, where a single interaction could wipe out 10% to 30% of a document’s integrity.
One of the study’s most discouraging revelations involves the use of “agentic harnesses,” giving AI access to file systems and code execution tools to help it work autonomously. Rather than improving accuracy, these tools made performance worse, leading to an additional 6% degradation on average.
“Microsoft’s DELEGATE-52 findings sharpen what observability practitioners have warned about: frontier models silently corrupt content across long, delegated workflows, and agentic tooling, retrieval, and planning layers do not close the gap,” said Mitch Ashley, vice president and practice lead of Software Lifecycle Engineering at The Futurum Group. “The ceiling on agent autonomy is set by what enterprises can see and reconstruct step by step.”
“Procurement teams should require per-step evidence trails as a baseline,” Ashley said. “Vendors shipping agentic workflows without workflow-level diffs, content provenance, and intervention points are asking buyers to trust outputs they cannot audit. The autonomy gap closes only when evidence catches up.”
The implications for the corporate world are significant. While Deloitte reports that organizations are now spending an average of 36% of their digital budgets on AI automation, Microsoft’s researchers warn that the technology isn’t yet capable of “set-it-and-forget-it” delegation. An intern who destroyed a quarter of a company’s data during a project would likely be fired, yet businesses are currently betting billions on software that does exactly that.
There is, however, a silver lining. The researchers noted that models are improving rapidly. OpenAI’s GPT family saw benchmark performance jump from 14.7% to 71.5% over just 16 months.
For now, the message from Redmond, Wash., is clear: if you’re delegating complex workflows to an AI agent, you had better stay in the loop. Until these models can prove they won’t shred the digital paperwork, human oversight remains the only safeguard against autonomous errors.

