Chroma's technical report tested 18 frontier models — GPT-4.1, Claude 4, Gemini 2.5, Qwen3 — and found every one degrades unevenly as input grows, even at modest token counts and even on trivially simple tasks. Performance was higher on shuffled inputs than on logically connected documents. The report named the phenomenon context rot.
The economic shadow shows up in MIT Project NANDA's July 2025 report, The GenAI Divide. Despite $30–40 billion in enterprise GenAI spend, 95% of organizations report no measurable P&L impact. The diagnosis is direct: the systems forget context, don't learn, and can't evolve.
Why do they fail? The Multi-Agent System Failure Taxonomy (MAST, October 2025) is the most rigorous answer to date. Built on 1,642 annotated execution traces from seven popular multi-agent frameworks (annotator agreement κ = 0.88), MAST identifies 14 failure modes across three categories: specification/system design (44.2%), inter-agent misalignment (32.3%), task verification (23.5%). The conclusion: failures are problems of coordination and context, not raw intelligence.
Cognition's contrarian counterpoint, “Don't Build Multi-Agents” (12 June 2025), reaches the same diagnosis from the opposite direction: the root of multi-agent fragility is fragmented context, not weak models. Agents must share full traces, not isolated messages.