The Context Window: Why Your AI Forgets, Drifts, and Contradicts Itself

A practical playbook for ChatGPT, Claude, Gemini, and Grok.

TL;DR

Frontier AI models in May 2026 advertise 1M+ token context windows. Peer-reviewed research finds they only use about 10 to 20% of those windows effectively.
The context window is not the model's memory. It's a long text document the vendor concatenates from seven different sources, only some of which you can see or control.
Five failure modes account for most of the "the AI is broken" complaints I hear: stale fact contradiction, instruction drift, lost-in-the-middle, tool variance, and cross-session reset. All five have specific fixes.
The discipline of managing this is called context engineering, named in June 2025 by Tobi Lütke and amplified by Andrej Karpathy. You are not making up a framework. You are practicing one the industry already named.
Most of your leverage is architectural: files on disk, Projects, context packs, handoff templates, subagents, and specialized search tools. Vendor features matter less than the playbook you bring to them.

If you want one thing to take home, scroll to What to do Monday morning at the bottom.

The wall

Three sentences I've heard this month, from three different people, in three different conversations:

"The AI forgot what we were doing."

"Yesterday's answer contradicts today's."

"It was great for ten turns then went sideways."

If you've used ChatGPT, Claude, Gemini, or Grok for real work, you've hit some version of this. The culprit is almost always the context window, not the model. The frontier models in May 2026 are not stupid. They are working with a constrained, opaque, vendor-managed text buffer that you can only partially see and rarely control.

This article is the long-form version of a talk I give on the topic. No theory. The mechanism, the failure modes, the fixes, and the citations so you can verify everything. It's also a reference. Pin it, share it with your team, come back to specific sections when you need them.

Section 1: What a context window actually is

Let's define our terms.

A token is the unit a model's tokenizer breaks your text into before the model sees it. Not a word. Not a character. A learned sub-word chunk, chosen by an algorithm during the model's training.

A context window is the maximum number of tokens a model can consider in a single inference call. It's the model's working memory for one turn. Everything the model "sees" right now: system prompt, files, prior turns, tool output, your latest message. If it is not in the window, it does not exist to the model.

Tokens are not words

The sentence "Claude is built by Anthropic." is seven tokens in OpenAI's current tokenizer (o200k_base, used across the GPT-4o, GPT-4.1, and GPT-5 families):

Claude / is / built / by / Anth / ropic / .

The model never sees the word "Anthropic." It sees two chunks. The tokenizer split it because "Anthropic" was rare enough in the training corpus relative to its sub-words.

This matters because:

Common English words usually cost one token.
Rare words, proper nouns, code, non-English text, and emoji often cost several.
Each vendor uses a different tokenizer, which means the same sentence becomes a different number of tokens depending on whose ruler you measure with.

Six strings, four tokenizers:

Sample text	OpenAI (o200k)	Claude (approx*)	Gemini (1.5+)	Grok-1
"The quick brown fox jumps over the lazy dog."	10	~10	10	10
"Claude is built by Anthropic."	7	~9	7	7
50-word business sentence	42	~42	45	45
"El veloz zorro marrón salta sobre el perro perezoso."	16	~19	11	17
`for i in range(10): print(i)`	10	~11	12	12
👋🏼🚀✨ (3 emoji)	6	~9	4	13

The Claude column uses community-reverse-engineered Claude 3-era estimates (Xenova/claude-tokenizer). Anthropic shipped a new tokenizer with Opus 4.7 in November 2025 that reports roughly 1.0 to 1.35x more tokens for English and 0.65 to 0.80x fewer tokens for non-Latin scripts. For exact counts on production work, hit POST /v1/messages/count_tokens against the model you actually use.

Three takeaways:

Plain English barely differs. All four agree on the quick brown fox.
Spanish, code, and emoji vary dramatically. Spanish: 11 (Gemini) to 19 (Claude approx). Emoji: 4 (Gemini) to 13 (Grok-1).
A "200K window" is not the same across vendors. Different rulers measure the same text differently.

How to verify per model

If you want to know what your token bill is actually going to look like:

OpenAI: pip install tiktoken. Open source. Inspect any string. github.com/openai/tiktoken
Claude: POST /v1/messages/count_tokens. Returns total only, no per-token breakdown. docs.anthropic.com/en/api/messages-count-tokens
Gemini: countTokens (total) or computeTokens (per-token). Local tokenizer via google-cloud-aiplatform[tokenization]. ai.google.dev/gemini-api/docs/tokens
Grok: Grok-1 tokenizer is public at github.com/xai-org/grok-1. Grok 2/3/4 tokenizers are not published; use the xAI tokenize API.

OpenAI is the only one that lets you inspect every token. Anthropic gives you a count. Google gives you both. xAI gives you a count and a playground if you're logged in.

Section 2: The context window IS a long text file

Here's the mental model that unlocks everything else.

When you send a message to an AI, the vendor concatenates a stack of inputs into one long text document and sends it to the model. The model reads top to bottom. There is no "memory" separate from text. There is no "files" separate from text. Everything the model sees is part of the same big document.

Seven things get concatenated:

Layer	What it actually is	Where it comes from	Can you see it?
System prompt	Vendor-written rules, persona, format guards	Vendor (or you in API/Projects)	No in chat. Yes in API.
Persistent memory	Running log of facts about you	Auto-generated by the AI	Partial. ChatGPT lets you view; Claude shows what's saved.
Project / workspace	Scoped folder + file library	You set it up	Yes.
Attached files	Documents you uploaded	You	Yes, but not always how much.
Conversation history	Every prior turn in this chat	You and the model	Mostly yes. Compression is invisible.
Tool definitions + results	Schemas and outputs from tools	Platform + tools you enabled	No. UI hides raw schemas.
Your latest message	What you just typed	You	Yes. The one part you definitely control.

Most users assume they're "talking to the AI." They're actually appending one line to a document the vendor wrote most of.

The tool overhead trap

Common misconception: "I've added a dozen tools to my AI agent. It'll just use the ones it needs."

Reality:

Every tool definition is text. It loads into the context window on every turn.
The model is stateless. It re-reads the full tool list every time.
A simple tool definition is ~100 tokens. A complex one with multiple parameters is ~500.
Ten tools at 250-token average is ~2,500 tokens of overhead before you type your message.
Tool results also stay in context. A web search returning five pages can drop 3,000 to 5,000 tokens into the window that persist for the rest of the chat.

Concrete example: a Claude agent with 15 MCP tools enabled (~3,000 tokens of definitions), a 50-turn conversation (~10,000 tokens), and a 5-page PDF in the Project (~5,000 tokens) starts every turn with roughly 18,000 tokens of overhead before you've typed anything. About 9% of a 200K window, gone.

The vendors have responded with prompt caching. The 2026 picture:

OpenAI: 90% discount on cache hits, automatic, TTL up to 24 hours.
Anthropic: up to 90% cost reduction, developer-controlled via cache_control breakpoints.
Gemini: 90% on Gemini 2.5+ and 3.x (75% was the 2.0 number).

All three converged on ~90% discounts. The real differentiator is the control model:

Automatic (OpenAI): you don't have to think about it.
Explicit (Anthropic): you mark cache breakpoints by hand and get exactly what you ask for.
Implicit (Gemini): the system caches as it sees fit.

Pick the one whose tradeoff matches your team's discipline. Don't add tools without thinking about the overhead.

The Chroma callout

A 2025 study from Chroma tested 18 frontier models including GPT-4.1, Claude Opus 4, and Gemini 2.5. They found something strange.

"Across all 18 models and needle-haystack configurations, we observe a consistent pattern that models perform better on shuffled haystacks than on logically structured ones."

Read that twice. When the text in the model's context is organized and coherent, the model does worse than when the same text is randomly scrambled. Researchers do not fully understand why. What they DO know: how the text in the window is structured matters as much as what is in it.

Source: Kelly Hong, Anton Troynikov, Jeff Huber. Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma Technical Report, July 14, 2025. research.trychroma.com/context-rot

Section 3: Compaction (how the window manages itself)

When your conversation outgrows the context window, the AI tool summarizes or trims older turns to make room. This is called compaction.

Compaction is lossy. The summarizer keeps recent, frequent, and relevant turns. It tends to drop project conventions, early decisions, and the "why" behind choices. Your job is twofold: trigger it on your terms, and tell it what must survive.

The cardinal rule: manual compaction at breakpoints beats auto-compaction at the cliff. Auto-compact summarizes against an already-degraded context, so the summary itself is worse than if you had compacted earlier.

When to compact

Situation	Action
Finished a discrete phase (auth module done, refactor complete)	`/compact` with focus instructions
Starting an unrelated task	`/clear` (not compact)
Around 60 to 70% context used	`/compact` proactively, before quality drops
Mid-task, context filling fast	`/compact` with tight preservation instructions
Quick lookup you don't want in history	`/btw <question>` (Claude Code overlay)
Heavy research coming up	Delegate to a subagent
You only need to roll back	`Esc + Esc` or `/rewind`, then "Summarize from here"

The three layers of compaction control

This is the playbook I use in Claude Code, derived from Anthropic's docs plus what I've learned shipping AI features.

Layer 1: One-shot instructions on the compaction command.

Don't just type /compact. Tell it what to keep. Name files, function names, error messages, decisions ruled out:

/compact Focus on the API contract changes, the three modified files
(auth/middleware.ts, auth/session.ts, api/login.ts), the failing test
name, and the rejected approach using JWT refresh tokens. Drop
debugging back-and-forth.

"Preserve the file paths" beats "remember what we did."

Layer 2: Persistent rules in CLAUDE.md plus a PostToolUse compact hook.

CLAUDE.md is a file in your repo. It loads at session start, so anything in it is your baseline context every time you start a session. To make it survive compaction inside a long session, add a PostToolUse hook with the compact matcher that re-injects CLAUDE.md (or an @CLAUDE.md mention restating it) right after /compact runs.

This is the single most underused lever in Claude Code. The combination of "loaded fresh at session start" plus "re-injected after every compact via hook" means your standing rules don't silently disappear mid-session.

Drop-in block for your repo's CLAUDE.md:

## Compact Instructions

When compacting this session, always preserve:
- Files modified with their paths
- Tests run and pass/fail status
- Architectural decisions and approaches explicitly rejected
- Current TODO with done vs in-progress
- Open questions blocked on humans

Drop:
- Step-by-step debugging output once a bug is resolved
- File contents already on disk (they will be re-read)
- Discussion of approaches we chose to use (the code reflects them)

Format: bullets, not prose.

Drop-in hook for .claude/settings.json (or ~/.claude/settings.json):

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "compact",
        "hooks": [
          {
            "type": "command",
            "command": "echo 'Reminder: re-read CLAUDE.md before proceeding. Restate any standing rules and the Compact Instructions block.'"
          }
        ]
      }
    ]
  }
}

In a real repo, point the hook at a script that injects the relevant lines from CLAUDE.md directly into the next turn.

Two nuances:

Project-wide rules (conventions, stack, banned patterns) belong in the body of CLAUDE.md, not the Compact Instructions section. Both load at session start; rules belong with rules.
Keep CLAUDE.md tight. It loads every session and eats your context budget before you start. Run /context to audit. Aim for under ~200 lines.

Layer 3: Architecture. The universal play.

Layers 1 and 2 are Claude Code specific. Layer 3 works in every tool:

Files on disk beat conversation history. Claude (and other tools) can re-read files when asked. Write durable state to DECISIONS.md or SESSION_NOTES.md rather than depending on chat to remember.
Subagents isolate research. They report a summary; raw exploration never hits your main context.
Disable MCP servers you're not using. /mcp or @server-name disable can free 10 to 20% before you even compact. Check with /context.
Front-load complex work. Quality is best in the first 60 to 70% of a session.

If you're on claude.ai, ChatGPT, Gemini, or any other tool, your only direct control is Layer 3, and it is still very powerful.

Common compaction failures

Failure	Cause	Fix
"It forgot the project rules"	Rules were in chat, not CLAUDE.md	Put always/never rules in CLAUDE.md body
Summary is vague ("worked on authentication")	Ran `/compact` with no instructions	Always pass focus instructions
Lost "why we rejected approach X"	Default summarizer drops rejected paths	List rejections explicitly in Compact Instructions
Quality dropped after compact	Triggered too late (95% full)	Compact at 60 to 70%, not 95%
CLAUDE.md burns context before work starts	Drift over time	Audit with `/context`, keep under ~200 lines
Subtle inconsistencies after compact	Conversation-only state got squashed	Move durable state to a file on disk

Verification prompt

Use this right after /compact:

Before we continue, confirm in 5-10 bullets:
1) Files currently in flight and their state
2) Tests passing / failing
3) Current TODO with done vs pending
4) Any decisions or constraints I gave you that you'd lose without me restating
5) What you think the next step is

If anything in (4) is thin, tell me and I'll restate it.

Catches summary failures while the conversation is still recoverable.

Platform reality (May 2026)

The asymmetry that matters:

Platform	Manual compaction?	Persistence layer
Claude Code	Full set: `/compact`, `/clear`, `/rewind`, `/btw`, `Esc+Esc`	CLAUDE.md + PostToolUse compact hook
Cursor (agent)	`/summarize` (Composer is trained for self-summarization)	Workspace + rules files
ChatGPT	Branch Conversations only (GA 2026, manual fork)	Projects + Custom Instructions + Memory + Memory Sources
Claude (claude.ai consumer)	No. Auto-summarizes silently.	Projects + Memory (all plans since March 2026)
Gemini (consumer)	No. Opaque truncation.	Gems (persistent across sessions)
Grok	No. Silent auto-compaction.	Memory + Grok Studio + Custom Agents
GitHub Copilot	No. Aggressive truncation, very short context.	Session-scoped
Microsoft 365 Copilot	No. Silent server-side.	Grounded in M365 Graph data

Claude Code is currently the only consumer-accessible tool with all three layers exposed. Cursor offers some. ChatGPT added a manual primitive in 2026 (Branch Conversations) but it's a fork-and-restart, not compaction control. Everything else gives you Layer 3 only.

That's not a flaw in the tools. It's a design choice. But it means your control over context is mostly architectural: files on disk, Projects, handoff prompts, subagents.

Section 4: The U-shape and the 50% rule

Stanford, UC Berkeley, and Samaya AI published a paper in 2023 (TACL Vol. 12, 2024) that quietly became foundational:

"We observe that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models."

Source: Liu, Lin, Hewitt, Paranjape, Bevilacqua, Petroni, Liang. Lost in the Middle: How Language Models Use Long Contexts. arxiv.org/abs/2307.03172

Models attend to the start and the end. The middle gets less love.

The pattern still shows up in 2025 and 2026 frontier models, though the shape varies by model. Calibration follow-up: Hsieh et al., Found in the Middle, arXiv:2406.16008, established the U-shape as an intrinsic attention bias.

The quote trick

A practical mitigation, verbatim from Anthropic's long-context guide:

When you want the model to find specific information in a long document, don't ask it the question directly. Tell it to quote the source first:

Quote the exact sentence in the document that mentions [thing],
then answer.

I tested this with a 30,000-word document and a single fact buried near the middle (a fake project codename, "OCTOPUS-7"). Without the quote prefix, the model often hedged or hallucinated. With it, the model nailed the answer.

This costs one line of prompt and produces a measurable accuracy gain. Use it on every long-document question.

Source: Anthropic, Long context prompting tips. docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/long-context-tips

The 50% rule

A judgment call, not a law. Tier by stakes:

Work type	Practical ceiling
Routine, comfortable, model-familiar territory	Up to 50% of stated capacity
Complex reasoning, code you'll ship, analysis you can't easily verify	Under 25%
Critical algorithms, regulated, legal, anything where a wrong answer costs you	Under 15 to 20%

This is my heuristic, not a result from a paper. The closest empirical anchor is BABILong's finding that "popular LLMs effectively utilize only 10-20% of the context." The 50/25/15-20 tiering is my judgment call based on shipping work.

Use what works for your stakes. If you're brainstorming a marketing email, 50% is fine. If you're shipping a payments algorithm, stay under 25%.

Tools to observe usage

Platform	How to check
Claude Code	`/context` shows percentage, per-MCP cost, skills cost
Claude API	`usage` object in response (`input_tokens`, `cache_read_input_tokens`)
OpenAI API	`usage` object (`prompt_tokens`, `completion_tokens`, `total_tokens`)
Gemini API	`usage_metadata` field in response
ChatGPT / Claude.ai / Gemini consumer	Not visible. Estimate via tokenizer playground.

Section 5: Four numbers that should bother you

Not vendor marketing. Peer-reviewed and pre-print research from 2024 and 2025.

10 to 20% of the context window is what popular LLMs actually use.

Source: Kuratov et al. BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack. NeurIPS 2024 Datasets & Benchmarks Track. arxiv.org/abs/2406.10149

"popular LLMs effectively utilize only 10-20% of the context and their performance declines sharply with increased reasoning complexity."

Caveat: tested on mid-2024 frontier models. Not independently re-run on 2025 or 2026 frontier models.

55.8% is GPT-4o's accuracy on book-comprehension questions. Random is 50%.

Source: Karpinska et al. One Thousand and One Pairs: A "novel" challenge for long-context language models. EMNLP 2024. novelchallenge.github.io

"no open-weight model performs above random chance, while GPT-4o achieves the highest accuracy at 55.8%."

The paper's setup is claim-verification on minimally-different true/false pairs about recently-published English fiction, which is why random is 50%. The public leaderboard has since posted higher numbers from newer models (Claude 3.5 Sonnet, o1) and some 2025 open-weight models have cleared chance.

99.3 to 69.7 is GPT-4o's drop when the question doesn't share keywords with the source.

Source: Modarressi et al. NoLiMa: Long-Context Evaluation Beyond Literal Matching. Adobe Research & LMU Munich. ICML 2025. arxiv.org/abs/2502.05167

"At 32K, for instance, 11 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%."

18 of 18 frontier models did BETTER on randomly shuffled context than on coherent documents.

Source: Hong, Troynikov, Huber. Context Rot. Chroma, July 14, 2025. research.trychroma.com/context-rot

The 2024 NVIDIA RULER benchmark (Hsieh et al.) added context: "only half of them can maintain satisfactory performance at the length of 32K." That headline was measured on April 2024 models. By May 2026, frontier models (Gemini 2.5/3.x Pro, Claude Opus 4.x, GPT-5 family, Llama 3.1 405B) clear RULER at 32K easily and most clear it at 128K. The interesting frontier is now 256K to 1M.

Successor benchmarks worth knowing: HELMET (Princeton, 2024 to 2025) and LongBench v2.

A March 2026 follow-up worth noting: Chroma shipped Context-1, a 20B agentic search model designed to address context rot by editing its own context. trychroma.com

Section 6: How the four tools differ (May 2026)

Tool	Max window	Persistence	Biggest gotcha
ChatGPT	GPT-5.5 ~1M+ API, 400K Codex	Memory + Memory Sources, Projects, Branch Conversations, Custom Instructions	Memory writes are automatic. Memory Sources lets you see which memories fired. Audit them.
Claude	Opus 4.7 1M (April 2026), Sonnet 4.x also 1M	Projects with file library, Memory across plans since March 2026, Managed Agents memory in enterprise public beta	Lost-in-the-middle on huge dumps. Anthropic is conspicuously quiet on long-context marketing.
Gemini	3.1 Pro 1M, 1.5 Pro 2M for legacy	Gems (persistent across sessions), Workspace grounding (Drive, Docs, Gmail)	99.7% NIAH claim is from the 2024 Gemini 1.5 Pro paper. Current docs walked back to ~99%.
Grok	Grok 4.20 and 4 Fast 2M, Grok 4.3 1M, Grok 4 base 256K	Memory (April 2025), Grok Studio, Custom Agents (capped at 4, 4,000-char instructions, March 2026)	Less guardrail tuning. X data adds noise.

Marketing versus reality

The cleanest case study in vendor positioning: Google's NIAH claim.

The February 2024 Gemini 1.5 Pro technical report said ">99.7% recall on NIAH up to 1M tokens." Google's current long-context docs page (last updated January 2026) softened to "~99% on a single query" and "up to 99% accuracy in many cases." Same vendor, two years apart, walked back the precision themselves.

OpenAI's 2026 counter: GPT-5.2 Thinking is marketed as "first to near-100% on MRCR 4-needle at 256K." GPT-5.5 is marketed as the first model where "the whole context window is genuinely usable." Note the shift in benchmark. NIAH-1 has become a marketing floor. Multi-needle (MRCR, NIAH-8) is the new frontier.

Anthropic stays conspicuously quiet on long-context marketing. They publish 1M-window models but don't push NIAH or MRCR scores in promotional copy. That silence is itself a signal.

The honest 2026 picture:

NIAH-1 (single needle in haystack) is roughly solved.
Multi-needle retrieval is the real test.
On NIAH-8 between 200K and 1M, every frontier model except Gemini 3 loses 30 to 60 percentage points.
The "1M window" on paper is not the same as the "1M window where the model actually performs."

The pattern across all four

Same plumbing, different names: system prompt, files, history, memory, retrieval.
Learn the plumbing once. You move tool to tool.
The "winning tool" is rarely the model. It is the one where you control the context best for your task.

Section 7: Five failure modes (and the fixes)

These are the five I demo live. Each one is reproducible at your desk in five minutes.

Failure 1: Stale fact contradiction

Setup: ChatGPT or Claude, single chat.

1. I'm planning a trip to Denver. My budget is $1,500.
2. Give me a 3-night itinerary in that budget.
3. Actually, budget changed to $4,000 and I want to add Aspen.
4. What's the weather there in October?
5. Remind me what my budget is and where I'm going.

The answer to step 5 usually contradicts something. Stale numbers leak. Aspen sometimes drops. Denver sometimes gets mentioned alone.

The fix: restate current state at the top of every turn. Don't trust the chat to track moving facts. Either reset the assumptions explicitly, or restart with a clean handoff prompt.

Failure 2: Instruction drift

Setup: a 15-turn chat with an explicit system rule at the top, like "Always bullet points. Never the word 'leverage.' End every answer with a one-line summary."

By turn 14, ask any fresh question. The model will usually violate the rule. It'll write a paragraph. It'll use "leverage" or close to it. It'll skip the one-line summary.

The fix: anchor standing rules in the system prompt or Project, not the chat body. Chat-body rules erode. System-prompt rules persist. If your tool exposes Projects (Claude, ChatGPT), Gems (Gemini), or Custom Instructions, put your standing rules there.

Failure 3: Lost in the middle

Setup: a long document with a single fact buried near the middle. Round 1, ask the model to find the fact. Round 2, ask it to quote the sentence containing the fact, then answer.

Round 1 often hedges or hallucinates. Round 2 usually nails it. The quote trick is from Anthropic's long-context guide and costs one line of prompt.

The fix: when finding information in a long document, demand a quote first. Then have the model answer based on the quoted sentence.

Failure 4: Same prompt, different tools

Setup: paste an identical prompt into ChatGPT, Claude, Gemini, and Grok in side-by-side tabs. Read the first five lines of each answer.

You'll get four different answers. Tone varies. Some make assumptions; some ask clarifying questions. Some invent numbers; some don't.

The fix: the tool is part of the prompt. If you've standardized on one tool and stop checking the others, you've also calcified one tool's bias into your work. Periodically rerun your important prompts in a different tool and compare.

Failure 5: Cross-session reset

Setup: a Project pre-built in Claude or ChatGPT with a one-page company brief loaded.

Open a brand new chat OUTSIDE the Project. Ask: "Continue our work on Acme Robotics." The model will flail. It might ask clarifying questions, hallucinate a fake Acme Robotics, or say it can't help.

Now open a fresh chat INSIDE the Project. Same prompt. The model picks up as if the conversation never paused.

The fix: engineer a context pack that loads every time. Use Projects, Gems, or Custom Agents. Keep a one-page brief per workstream. Reference it by name in your prompts.

Section 8: The playbook

Five moves. The index card that goes in someone's Slack tomorrow.

Structure every prompt.
Anchor with files and instructions.
Know when to start fresh.
Standardize patterns across the team.
Measure and iterate.

Here's each move in practice.

Context engineering: this has a name now

Before the moves, a word on the discipline.

In June 2025, three posts crystallized a term the industry had been groping toward.

Tobi Lütke, Shopify CEO, June 19, 2025 (the popularizer):

"I really like the term 'context engineering' over prompt engineering. It describes the core skill better: the art of providing all the context for the task to be plausibly solvable by the LLM."

Andrej Karpathy, June 25, 2025 (the amplifier):

"+1 for 'context engineering' over 'prompt engineering.' Context engineering is the delicate art and science of filling the context window with just the right information for the next step."

Source: x.com/karpathy/status/1937902205765607626

Simon Willison, June 27, 2025 (the chronicler and defender of the rename). Source: simonwillison.net/2025/jun/27/context-engineering

Anthropic published the practitioner reference in September 2025: Effective Context Engineering for AI Agents. The question they ask is the right one: "what configuration of context is most likely to generate our model's desired behavior?" anthropic.com/engineering/effective-context-engineering-for-ai-agents

You are not making up a framework. You are practicing a discipline the industry named in June 2025.

Move 1: Structure every prompt

Five-part template. Use this for any prompt that matters:

ROLE: who the model is for this task.
CONTEXT: background facts, constraints, prior work.
TASK: what you want done, specifically.
CONSTRAINTS: what it must or must not do.
FORMAT: what the output should look like.

Put the most important instruction at the bottom. Recency bias is real. The model weighs the last instruction most heavily. This is the U-shape from Section 4 applied to prompts: the start and the end get attention; the middle gets less.

Save this template as a snippet in your text expander. Run a bad prompt and the same prompt rewritten with this template side by side. Show the diff to your team.

Move 2: There's power in naming (and anchoring with files)

This is the technique I use most often. In witchcraft, they say there is power in naming something. The same is true in AI.

The right word, the named methodology, the canonical pattern, the term of art, carries the full weight of expert knowledge the model has read thousands of pages about.

Naming has two effects.

Compression. Saying "SOLID principles" gets you everything the model has read about single responsibility, open/closed, Liskov, interface segregation, and dependency inversion. In two words.

Behavior switch. Saying "RTFM" doesn't just tell the model what the acronym means. It tells the model how experts behave when they're doing RTFM. The name is an instruction.

Two examples I use constantly:

RTFM (Read The Manual). When I start a prompt with RTFM, the model knows I want it to check documentation, look up specifics, and lean on cited sources instead of training memory. I get researched answers and fewer confident hallucinations.
TDD (Test-Driven Development). When I tell the model "use TDD," it knows the discipline: write the failing test first, then the minimum code to pass, then refactor. I don't have to spell it out.

Examples by category:

Category	What naming invokes	Examples
Methodology / workflow	A discipline or process	RTFM, TDD, BDD, GTD, OKRs
Pattern (structural)	A code or system shape	Singleton, Observer, MVC, Factory, Strategy
Principle (constraint)	A rule the model should respect	SOLID, DRY, YAGNI, KISS, separation of concerns
Framework (checklist)	A structured output format	AIDA, SWOT, SOAP notes, 4Ps, RACI
Domain term	A specific body of knowledge	Black-Scholes, hexagonal architecture, idempotent operations

Token math: "Use TDD" is 3 tokens. The full TDD discipline described in prose is ~300 tokens. Compression ratio: 100 to 1. That ratio is real space in your context window AND real money on your token bill AND a behavior change you couldn't get reliably with the prose version.

Anchoring with files. Naming works inside the prompt. For longer-running context (project conventions, customer profile, tone of voice, vocabulary), anchor with files.

Use Projects (Claude, ChatGPT) or Gems (Gemini) for stable context.
Keep a living "AI brief" doc per workstream, one page max: who, what, vocabulary, do/don't, success criteria.
Reference files by name. Don't re-paste content the model already has.

How to test a term before relying on it. Ask the model to define the term first. If it nails the definition AND describes the implied workflow, the term works. If it hedges or gets vague, write it out instead.

When NOT to use a name:

Term is post-training (less than ~12 months old at the model's cutoff).
Term has multiple meanings across domains (always specify which one).
The reader of your prompt won't know the term either.

Move 3: Know when to start fresh

The two-corrections rule: if you correct the same mistake twice, restart.

Once the model has the mistake locked in, every subsequent turn weighs the wrong thing as much as the right thing. You're fighting your own context.

Carry forward a five-line handoff summary, not the whole thread:

Project: [name]
Where we are: [current state in 1-2 sentences]
Decisions made: [bullets, with rationale]
Approaches ruled out: [bullets, with rationale]
What's next: [specific next step]

Long chats are a smell, not a feature. The "longest chat ever" competition has no winners.

Move 4: Don't drag, delegate (the subagent pattern)

When you need to work through a lot of information, don't drag all of it into your main chat. Delegate to a subagent.

A subagent is a separate AI call you make from inside your own AI work. It has its own fresh context window. You give it a task. It does the heavy lifting. It returns only the answer.

Example: "Find the top 3 customer complaints in this 200-page support log."

Drag-everything approach (bad): paste 200 pages into the main chat (~150,000 tokens). Ask the model to read, analyze, and rank. Main context consumed: ~151,000 tokens. The model strains in one turn.
Subagent approach (good): main chat dispatches a subagent. Subagent processes the log inside its own fresh context window. Returns ~500 tokens of clean output. Main context consumed: ~500 tokens. The 200-page log never enters it.

The model in your main chat is the executive. The subagent is the analyst. Don't make your executive read the whole report. Make them read the analyst's memo.

Tooling (developer-facing, May 2026):

Claude: Claude Code Agent tool (renamed from Task in v2.1.63), Agent SDK subagents, plus Agent Teams, Background Agents, Managed Agents. Always-shipped built-in is general-purpose. Common patterns include code-reviewer and test-runner. User-defined agents are Markdown files with YAML frontmatter in .claude/agents/. No officially documented concurrency cap; community reports show 20+ parallel subagents in practice.
OpenAI: Codex Subagents (TOML-configured with name, description, model, sandbox_mode, mcp_servers), Agents SDK (Handoffs primitive), Deep Research (orchestrator). Deep Research caps (May 2026): Free has no access; Plus 10/mo; Pro $100 500/mo; Pro $200 unlimited.
Google: Agent Development Kit (ADK) and Gemini CLI subagents. First-class multi-agent graphs in ADK. Consumer Gemini app cannot directly spawn subagents (developer-only).
xAI: Grok 4.20 Multi-Agent. A leader agent plus unnamed sub-agents. Agent count tied to reasoning effort: ~4 at low/medium, ~16 at high/xhigh. Not user-spawnable as developer subagents.

Non-vendor players worth knowing: LangGraph, Microsoft AutoGen 0.4+, CrewAI, Cognition Devin, Cursor Agents.

Reference reading: Anthropic, Building Effective Agents. anthropic.com/research/building-effective-agents

Move 5: External tools that pre-filter your context

Instead of having your AI read raw data and figure out what matters, hand the search-classify-summarize step to a specialized service. Your AI only sees the answer.

In 2026, every major vendor ships its own grounded search:

Anthropic: Claude web search (API + claude.ai).
OpenAI: web_search tool (Responses API + ChatGPT).
Google: Gemini Grounding (Search + URL grounding in the Gemini API).

You don't have to bolt on a third-party service for grounded answers. The third-party services below still matter when you want specialization, scale, or vendor neutrality:

Service	What it does
Brave Search API	LLM Context endpoint + AI Grounding (94.1% F1 on SimpleQA). Free credits (~$5/mo). brave.com/search/api/
Exa AI	Neural-embedding search. Tiers: Fast / Auto / Deep. Websets for agentic batch. Free 1K/mo + paid. exa.ai
Perplexity Sonar	Sonar, Sonar Pro (single-shot), Sonar Pro Search (agentic), Sonar Reasoning, Sonar Deep Research. docs.perplexity.ai
Tavily	Search / Extract / Crawl / Research. Acquired by Nebius for $275M in February 2026. tavily.com

Worth naming if you build production AI: Linkup, Jina AI Reader and Search, Firecrawl.

MCP (Model Context Protocol) is the open standard that makes this composable. Originally from Anthropic in November 2024. Donated to the Linux Foundation's Agentic AI Foundation (AAIF) on December 9, 2025. Native support across Anthropic, OpenAI, Google, and Microsoft by end of 2025. Often described as "USB-C for AI." Per AAIF: 97M+ monthly SDK downloads, ~10K active servers. modelcontextprotocol.io

If you build AI systems for a living, this is the unlock. Five years ago you would have written 500 lines of Python to wire up a search service. With MCP, it's a config file.

Section 9: Standardize and measure (for teams)

Individual practice scales when the team adopts shared patterns.

Shared prompt library, versioned like code. Treat prompts as a code asset. PR them. Review them. Tag releases.
Shared context packs per role. Voice, vocabulary, do/don't. One per function (sales, engineering, support, leadership).
Three canonical tasks per team, run weekly. Track three things: time to first usable output, edits required, drift events.
That's how "AI productivity" becomes a number, not a vibe.

Practitioners worth reading on this:

Hamel Husain, P6: Context Rot. hamel.dev/notes/llm/rag/p6-context_rot.html
Eugene Yan, long-context Q&A eval primer at eugeneyan.com
Husain, Yan, et al., What We Learned from a Year of Building with LLMs. applied-llms.org
LangChain, The Rise of Context Engineering. blog.langchain.com/the-rise-of-context-engineering
Philipp Schmid, The New Skill in AI is Not Prompting, It's Context Engineering. philschmid.de/context-engineering
O'Reilly, Context Engineering with DSPy (book).

What to do Monday morning

Pick one workflow you do weekly.
Write a one-page AI brief for it. Save it as a Project, Gem, or Custom Agent.
Run it three times this week. Log what breaks: stale facts, instruction drift, lost-in-middle, tool variance, cross-session reset.
Bring the log to your team.

That's the loop. Workflow, brief, run, log, discuss. Do it three weeks in a row and you'll have a measurable practice. Do it three months in a row and you'll have a team standard.

Most of what separates people who get consistent results from people who don't is exactly this: a repeatable, observable, measurable loop. The tools change. The discipline doesn't.

Citations and further reading

Every quote in this article is verified. If you want to challenge or extend any point, here are the originals.

Research papers

Liu et al. Lost in the Middle: How Language Models Use Long Contexts. Stanford + UC Berkeley + Samaya AI. TACL Vol. 12, 2024 (arXiv preprint 2023). arxiv.org/abs/2307.03172
Kuratov et al. BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack. NeurIPS 2024 Datasets & Benchmarks Track. arxiv.org/abs/2406.10149
Karpinska et al. One Thousand and One Pairs: A "novel" challenge for long-context language models (NoCha). EMNLP 2024. novelchallenge.github.io
Modarressi et al. NoLiMa: Long-Context Evaluation Beyond Literal Matching. Adobe Research & LMU Munich (MCML). ICML 2025. arxiv.org/abs/2502.05167
Hong, Troynikov, Huber. Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma, July 14, 2025. research.trychroma.com/context-rot
Hsieh et al. RULER: What's the Real Context Size of Your Long-Context Language Models? NVIDIA, 2024. arxiv.org/abs/2404.06654
Hsieh et al. Found in the Middle: Calibrating Positional Attention Bias. arXiv:2406.16008.

Industry pieces

Anthropic, Effective Context Engineering for AI Agents. September 29, 2025. anthropic.com/engineering/effective-context-engineering-for-ai-agents
Anthropic, Building Effective Agents. anthropic.com/research/building-effective-agents
Anthropic, Long context prompting tips. docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/long-context-tips
Karpathy, Lütke, Willison (June 2025) on the rename to "context engineering."

Tools

OpenAI tokenizer playground: platform.openai.com/tokenizer
OpenAI tiktoken: github.com/openai/tiktoken
Anthropic count_tokens: docs.anthropic.com/en/api/messages-count-tokens
Gemini tokens: ai.google.dev/gemini-api/docs/tokens
Model Context Protocol: modelcontextprotocol.io

Search services

Brave Search API: brave.com/search/api/
Exa AI: exa.ai
Perplexity Sonar: docs.perplexity.ai
Tavily: tavily.com

Michael Martin is the Founder and Chief Product Officer of Digital2DNA in Bellbrook, Ohio. He's been applying AI to production problems since 2020 and was named 2019 Cincinnati CIO of the Year by the Cincinnati Business Courier. Reach him at michael.martin@digital2dna.com.