Science · Research foundation

The literature behind context orchestration.

In the last eighteen months, a discipline has named itself. Context engineering — the curation of what a model sees before every step — is now where the operational reliability of agents is decided. This page is a curated reading list across four themes, plus the gap our own work addresses.

01 / The discipline forms

Prompt engineering became context engineering.

Between June and September 2025, the field that practitioners had been quietly running converged on a name. Andrej Karpathy's “Context Engineering” post (25 June 2025) introduced the term to a general audience; Shopify CEO Tobi Lütke endorsed it a week earlier as the more accurate description of the actual skill; Anthropic's Applied AI team formalized it on 29 September 2025 as “the set of strategies for curating and maintaining the optimal set of tokens during inference.”

LangChain's June 2025 article — “Context Engineering for Agents” — crystallized the operating taxonomy practitioners now share: write, select, compress, isolate. Anthropic, separately, codified four canonical patterns — compaction, structured notes, agent isolation, just-in-time retrieval — and explicitly named context rot as an architectural limit that persists even at million-token windows.

The result is shared vocabulary. The decisions that determine whether an agent works in production are no longer scattered across prompt engineering, RAG plumbing, and memory hacks — they live inside one named discipline.

02 / Where it breaks at scale

The wall is in the context, not the model.

Chroma's technical report tested 18 frontier models — GPT-4.1, Claude 4, Gemini 2.5, Qwen3 — and found every one degrades unevenly as input grows, even at modest token counts and even on trivially simple tasks. Performance was higher on shuffled inputs than on logically connected documents. The report named the phenomenon context rot.

The economic shadow shows up in MIT Project NANDA's July 2025 report, The GenAI Divide. Despite $30–40 billion in enterprise GenAI spend, 95% of organizations report no measurable P&L impact. The diagnosis is direct: the systems forget context, don't learn, and can't evolve.

Why do they fail? The Multi-Agent System Failure Taxonomy (MAST, October 2025) is the most rigorous answer to date. Built on 1,642 annotated execution traces from seven popular multi-agent frameworks (annotator agreement κ = 0.88), MAST identifies 14 failure modes across three categories: specification/system design (44.2%), inter-agent misalignment (32.3%), task verification (23.5%). The conclusion: failures are problems of coordination and context, not raw intelligence.

Cognition's contrarian counterpoint, “Don't Build Multi-Agents” (12 June 2025), reaches the same diagnosis from the opposite direction: the root of multi-agent fragility is fragmented context, not weak models. Agents must share full traces, not isolated messages.

03 / The agent is mostly infrastructure

Agent = Model + Harness.

Mitchell Hashimoto's formula — Agent = Model + Harness — has become the operational frame for 2026. The empirical case comes from a reverse-engineering of Claude Code (v2.1.88), which found that 98.4% of an industrial agent's codebase is operational infrastructure: security, context management, memory, tools. Only a sliver is the decision logic the model itself runs.

The same study surfaced an architectural side-effect that matters here: an oversight paradox. Bounded context windows and compression losses lead the model to take locally optimal actions without full understanding of the codebase. Users approve roughly 93% of permission requests — the approval-fatigue pattern — while external research found AI-assisted development raises code complexity by 40.7% and erodes the developer's own model of the project.

The capability ramp is real. Crashing Waves vs. Rising Tides (April 2026) reports that frontier systems already achieve 50% success on tasks that take humans 3–4 hours. But the slope of the success curve is shallow: reasoning gains will be outpaced, in practical terms, by the systems engineering needed to land them in real corporate workflows. The bottleneck is the last-mile harness, not the model.

04 / Small models, on device

The model that runs the orchestration shouldn't be the largest one.

NVIDIA's June 2025 position paper — “Small Language Models are the Future of Agentic AI” — argues that small language models (working definition: under 10B parameters, executable on user devices) are sufficient, more economical, and more architecturally appropriate for the majority of agent calls. The paper includes an LLM-to-SLM agent-conversion algorithm and case studies on popular open agents.

Empirical reinforcement: the Memory Intelligence Agent (MIA) work showed that strategic abstraction can erase quality gaps between model sizes — a 7B-parameter executor outperformed a 32B variant by 18%. MIA compresses raw interaction traces into high-level workflow summaries and dynamically updates a planner agent via alternating reinforcement learning. The lesson: learning the process of solving a task is more computationally efficient than enlarging the context window or scaling parameters.

Two more references close out the small-model toolkit. LLMLingua-2 (March 2024) is the canonical reference on task-agnostic prompt compression via data distillation. TabDistill (April 2025) shows how to distill feature interactions out of foundation models into compact, interpretable GAMs without losing transparency — directly relevant when the on-device model needs to remain auditable.

Synstate's design choice — a 3B-parameter orchestration model running entirely on-device — sits in this lineage. The orchestration layer is not the place to spend frontier-scale compute.

05 / The missing input

Context engineering still doesn't read the situation.

Across the literature above, the candidate context an agent assembles is treated as a function of what is available: retrieval, memory, tools, history. None of the canonical taxonomies — write / select / compress / isolate — has an input for who the user is, where they are in the task, or how much load they are carrying.

This is the next layer to be named. The same request, “help me fix this bug,” from a user in deep flow and a user 45 minutes stuck deserves opposite responses — and today's agents cannot distinguish the two. The signal exists and is measurable. It is just not yet in the context window, and the harness around the model never sees it either.

That gap — situation as a missing input to context engineering — is the work Synstate Labs runs. Qualia-1 is a context orchestration model that reads the live situation around a request and uses it to reconcile both halves of the agent: the context, and the harness around it. The model is small. It runs on-device. The signals are behavioral and interaction-derived, not biometric — which is also what keeps it on the right side of the EU AI Act's workplace provisions.

See how Qualia-1 implements this
06 / Bibliography

The reading list.

A curated selection from our internal tracking. Grouped by theme, in reading order. Reach out if you want extensions on a specific track — economics, predictions, or otherwise.

Context management

Context management · April 2026

Memory Intelligence Agent (MIA)

An empirical demonstration that smart memory management and strategic abstraction can erase quality gaps between small and large models — a 7B-parameter executor beat a 32B variant by 18%. MIA compresses raw interaction traces into high-level workflow summaries, then updates a planner agent via alternating RL. Key takeaway: learning the process of solving a task is more compute-efficient than scaling context or parameters.

Agents fundamentals · 14 July 2025

Context Rot: How Increasing Input Tokens Impacts LLM Performance

Chroma's technical report. Tested 18 frontier models (GPT-4.1, Claude 4, Gemini 2.5, Qwen3) and found uneven degradation as input grows — even at modest token counts and on trivially simple tasks. Models performed better on shuffled inputs than on logically connected documents.

Agents fundamentals · 29 September 2025

Effective Context Engineering for AI Agents — Anthropic Applied AI

Anthropic formally redefines prompt engineering as a subset of context engineering — strategies for curating and maintaining the optimal token set at inference. Codifies four canonical patterns: compaction, structured notes, sub-agent isolation, just-in-time retrieval. Names context rot as an architectural limit even at million-token windows.

Agents fundamentals · 25 June 2025

Context Engineering for Agents — LangChain

Crystallizes the operating taxonomy practitioners use — write, select, compress, isolate — with worked examples from Anthropic's multi-agent researcher and the Windsurf code-retrieval system. LangChain's framing is the de facto shared vocabulary across the industry.

Agents fundamentals · 19 March 2024

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

The canonical reference on prompt compression. Data distillation produces an efficient, faithful, task-agnostic compressor of input context.

Foundational principles of agent building

Agents fundamentals · 2026

Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

Reverse-engineering of Anthropic's Claude Code (v2.1.88). Headline finding: 98.4% of the codebase is operational infrastructure (security, context, memory) — not decision logic. As model capabilities converge, deterministic harness becomes the competitive surface. The paper documents an oversight paradox (locally-optimal actions from compressed context), approval fatigue (≈93% approval rate on permission requests), and external evidence of a 40.7% rise in code complexity under AI-assisted development.

Agents fundamentals · 26 October 2025

Why Do Multi-Agent LLM Systems Fail? (MAST)

The most rigorous empirical map of agent-system failure modes. 1,642 annotated execution traces from seven popular multi-agent frameworks (annotator agreement κ = 0.88). 14 failure modes across three categories: specification/system design (44.2%), inter-agent misalignment (32.3%), task verification (23.5%). Failures are coordination and context problems, not raw-intelligence problems.

Agents fundamentals · 12 June 2025

Don't Build Multi-Agents — Cognition (Walden Yan)

The contrarian principle paper. Argues agents must share full context — including full agent traces, not just isolated messages — and that the root of multi-agent fragility is fragmented context, not weak models. Published one day before Anthropic's competing multi-agent post, framing the public debate that ran through 2025–2026.

Agents fundamentals · 25 June 2025

Andrej Karpathy on “Context Engineering” — X/Twitter

The post that put the term into general circulation. Karpathy describes context engineering as the delicate art and science of filling the context window with exactly the information the next step requires. Paired with Shopify CEO Tobi Lütke's 18 June 2025 post arguing the same name change.

Agents fundamentals · 1 April 2026

Crashing Waves vs. Rising Tides: Preliminary Findings on AI Automation from Worker Evaluations of Labor Market Tasks

Frontier systems already achieve 50% success on tasks that take humans 3–4 hours. But the slope of the success curve flattens, so the path to reliable economic impact is longer than capability alone implies. The barrier is systems engineering — the last mile of putting a model into a real workflow.

Agents fundamentals · July 2025

The GenAI Divide: State of AI in Business 2025 — MIT Project NANDA

The report behind the viral “$30–40B / 95%” statistic. Despite $30–40 billion in enterprise GenAI spend, 95% of organizations report no measurable P&L impact. Diagnosis: the systems forget context, don't learn, and can't evolve.

Compact on-device models

Local models · 23 April 2025

Selecting Feature Interactions for Generalized Additive Models by Distilling Foundation Models

TabDistill. Transfer of knowledge from large foundation models into compact, interpretable GAMs by distilling feature interactions out of the larger model — without losing the small model's transparency.

Agents fundamentals · 2 June 2025

Small Language Models are the Future of Agentic AI — NVIDIA

Position paper arguing that small language models — under 10B parameters, executable on user devices — are sufficient and more architecturally appropriate for the majority of agent calls. Includes an LLM-to-SLM agent-conversion algorithm and case studies on popular open agents.