12 KiB
Prompt Caching — Design & Optimization
This file covers how to design prompt-building code for effective caching. For language-specific syntax, see the ## Prompt Caching section in each language's README or single-file doc.
The one invariant everything follows from
Prompt caching is a prefix match. Any change anywhere in the prefix invalidates everything after it.
The cache key is derived from the exact bytes of the rendered prompt up to each cache_control breakpoint. A single byte difference at position N — a timestamp, a reordered JSON key, a different tool in the list — invalidates the cache for all breakpoints at positions ≥ N.
Render order is: tools → system → messages. A breakpoint on the last system block caches both tools and system together.
Design the prompt-building path around this constraint. Get the ordering right and most caching works for free. Get it wrong and no amount of cache_control markers will help.
Workflow for optimizing existing code
When asked to add or optimize caching:
- Trace the prompt assembly path. Find where
system,tools, andmessagesare constructed. Identify every input that flows into them. - Classify each input by stability:
- Never changes → belongs early in the prompt, before any breakpoint
- Changes per-session → belongs after the global prefix, cache per-session
- Changes per-turn → belongs at the end, after the last breakpoint
- Changes per-request (timestamps, UUIDs, random IDs) → eliminate or move to the very end
- Check rendered order matches stability order. Stable content must physically precede volatile content. If a timestamp is interpolated into the system prompt header, everything after it is uncacheable regardless of markers.
- Place breakpoints at stability boundaries. See placement patterns below.
- Audit for silent invalidators. See anti-patterns table.
Placement patterns
Large system prompt shared across many requests
Put a breakpoint on the last system text block. If there are tools, they render before system — the marker on the last system block caches tools + system together.
"system": [
{"type": "text", "text": "<large shared prompt>", "cache_control": {"type": "ephemeral"}}
]
Multi-turn conversations
Put a breakpoint on the last content block of the most-recently-appended turn. Each subsequent request reuses the entire prior conversation prefix. Earlier breakpoints remain valid read points, so hits accrue incrementally as the conversation grows.
// Last content block of the last user turn
messages[-1].content[-1].cache_control = {"type": "ephemeral"}
Shared prefix, varying suffix
Many requests share a large fixed preamble (few-shot examples, retrieved docs, instructions) but differ in the final question. Put the breakpoint at the end of the shared portion, not at the end of the whole prompt — otherwise every request writes a distinct cache entry and nothing is ever read.
"messages": [{"role": "user", "content": [
{"type": "text", "text": "<shared context>", "cache_control": {"type": "ephemeral"}},
{"type": "text", "text": "<varying question>"} // no marker — differs every time
]}]
Prompts that change from the beginning every time
Don't cache. If the first 1K tokens differ per request, there is no reusable prefix. Adding cache_control only pays the cache-write premium with zero reads. Leave it off.
Architectural guidance
These are the decisions that matter more than marker placement. Fix these first.
Keep the system prompt frozen. Don't interpolate "current date: X", "mode: Y", "user name: Z" into the system prompt — those sit at the front of the prefix and invalidate everything downstream. Inject dynamic context as a user or assistant message later in messages. A message at turn 5 invalidates nothing before turn 5.
Don't change tools or model mid-conversation. Tools render at position 0; adding, removing, or reordering a tool invalidates the entire cache. Same for switching models (caches are model-scoped). If you need "modes", don't swap the tool set — give Claude a tool that records the mode transition, or pass the mode as message content. Serialize tools deterministically (sort by name).
Fork operations must reuse the parent's exact prefix. Side computations (summarization, compaction, sub-agents) often spin up a separate API call. If the fork rebuilds system / tools / model with any difference, it misses the parent's cache entirely. Copy the parent's system, tools, and model verbatim, then append fork-specific content at the end.
Silent invalidators
When reviewing code, grep for these inside anything that feeds the prompt prefix:
| Pattern | Why it breaks caching |
|---|---|
datetime.now() / Date.now() / time.time() in system prompt |
Prefix changes every request |
uuid4() / crypto.randomUUID() / request IDs early in content |
Same — every request is unique |
json.dumps(d) without sort_keys=True / iterating a set |
Non-deterministic serialization → prefix bytes differ |
| f-string interpolating session/user ID into system prompt | Per-user prefix; no cross-user sharing |
Conditional system sections (if flag: system += ...) |
Every flag combination is a distinct prefix |
tools=build_tools(user) where set varies per user |
Tools render at position 0; nothing caches across users |
Fix by moving the dynamic piece after the last breakpoint, making it deterministic, or deleting it if it's not load-bearing.
API reference
"cache_control": {"type": "ephemeral"} // 5-minute TTL (default)
"cache_control": {"type": "ephemeral", "ttl": "1h"} // 1-hour TTL
- Max 4
cache_controlbreakpoints per request. - Goes on any content block: system text blocks, tool definitions, message content blocks (
text,image,tool_use,tool_result,document). - Top-level
cache_controlonmessages.create()auto-places on the last cacheable block — simplest option when you don't need fine-grained placement. - Minimum cacheable prefix is model-dependent. Shorter prefixes silently won't cache even with a marker — no error, just
cache_creation_input_tokens: 0:
| Model | Minimum |
|---|---|
| Opus 4.7, Opus 4.6, Opus 4.5, Haiku 4.5 | 4096 tokens |
| Sonnet 4.6, Haiku 3.5, Haiku 3 | 2048 tokens |
| Sonnet 4.5, Sonnet 4.1, Sonnet 4, Sonnet 3.7 | 1024 tokens |
A 3K-token prompt caches on Sonnet 4.5 but silently won't on Opus 4.7.
Economics: Cache reads cost ~0.1× base input price. Cache writes cost 1.25× for 5-minute TTL, 2× for 1-hour TTL. Break-even depends on TTL: with 5-minute TTL, two requests break even (1.25× + 0.1× = 1.35× vs 2× uncached); with 1-hour TTL, you need at least three requests (2× + 0.2× = 2.2× vs 3× uncached). The 1-hour TTL keeps entries alive across gaps in bursty traffic, but the doubled write cost means it needs more reads to pay off.
Verifying cache hits
The response usage object reports cache activity:
| Field | Meaning |
|---|---|
cache_creation_input_tokens |
Tokens written to cache this request (you paid the ~1.25× write premium) |
cache_read_input_tokens |
Tokens served from cache this request (you paid ~0.1×) |
input_tokens |
Tokens processed at full price (not cached) |
If cache_read_input_tokens is zero across repeated requests with identical prefixes, a silent invalidator is at work — diff the rendered prompt bytes between two requests to find it.
input_tokens is the uncached remainder only. Total prompt size = input_tokens + cache_creation_input_tokens + cache_read_input_tokens. If your agent ran for hours but input_tokens shows 4K, the rest was served from cache — check the sum, not the single field.
Language-specific access: response.usage.cache_read_input_tokens (Python/TS/Ruby), $message->usage->cacheReadInputTokens (PHP), resp.Usage.CacheReadInputTokens (Go/C#), .usage().cacheReadInputTokens() (Java).
Invalidation hierarchy
Not every parameter change invalidates everything. The API has three cache tiers, and changes only invalidate their own tier and below:
| Change | Tools cache | System cache | Messages cache |
|---|---|---|---|
| Tool definitions (add/remove/reorder) | ❌ | ❌ | ❌ |
| Model switch | ❌ | ❌ | ❌ |
speed, web-search, citations toggle |
✅ | ❌ | ❌ |
| System prompt content | ✅ | ❌ | ❌ |
tool_choice, images, thinking enable/disable |
✅ | ✅ | ❌ |
| Message content | ✅ | ✅ | ❌ |
Implication: you can change tool_choice per-request or toggle thinking without losing the tools+system cache. Don't over-worry about these — only tool-definition and model changes force a full rebuild.
20-block lookback window
Each breakpoint walks backward at most 20 content blocks to find a prior cache entry. If a single turn adds more than 20 blocks (common in agentic loops with many tool_use/tool_result pairs), the next request's breakpoint won't find the previous cache and silently misses.
Fix: place an intermediate breakpoint every ~15 blocks in long turns, or put the marker on a block that's within 20 of the previous turn's last cached block.
Concurrent-request timing
A cache entry becomes readable only after the first response begins streaming. N parallel requests with identical prefixes all pay full price — none can read what the others are still writing.
For fan-out patterns: send 1 request, await the first streamed token (not the full response), then fire the remaining N−1. They'll read the cache the first one just wrote.
Pre-warming the cache
To eliminate the cache-miss latency on the first real request, send a max_tokens: 0 request at startup (or on an interval). The API runs prefill — writing the cache at your cache_control breakpoint — and returns immediately with content: [], stop_reason: "max_tokens", and a populated usage block (zero output tokens billed; normal cache-write charge on cache_creation_input_tokens).
When to pre-warm — pre-warming trades a cache-write charge now for lower TTFT on the next real request. It's worth it when all three hold: (a) first-request latency is user-visible (chat/voice/interactive — not background jobs), (b) the shared prefix is large enough that a cold write is noticeably slow, and (c) there's a moment before traffic to fire it — app startup, worker boot, post-deploy, start of a scheduled window.
| Skip pre-warming when… | Because |
|---|---|
| Traffic is continuous (requests ≤ TTL apart) | The first real request warms the cache and every subsequent one hits it; a separate warm call is a pure extra write |
| The prefix is small or below the cacheable minimum | The cold-write penalty is negligible |
| The prefix varies per request/user | Nothing shared to pre-warm |
| You'd pre-warm many distinct prefixes speculatively | Each is a ~1.25× write; cost can exceed the latency you save |
Scheduled re-warms: only needed when traffic has gaps longer than the TTL. If real requests arrive more often than every 5 minutes, they keep the cache warm on their own — don't add an interval re-warm. For bursty traffic with long idle gaps, either re-warm just under the TTL or switch to ttl: "1h" and re-warm less often.
client.messages.create(
model="{{OPUS_ID}}",
max_tokens=0,
system=[{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}],
messages=[{"role": "user", "content": "warmup"}],
)
Breakpoint placement: put cache_control on the last block shared with the real request (the system prompt or tool definitions) — not on the placeholder user message, and not via top-level automatic caching (which would key the cache to the placeholder). The placeholder can be any non-whitespace string; it's read during prefill but never answered.
Rejected combinations: max_tokens: 0 is an invalid_request_error with stream: true, thinking.type: "enabled", output_config.format, tool_choice of {"type":"tool"} or {"type":"any"}, or inside a Message Batches request.
TTL still applies — re-warm at least every 5 minutes for the default cache, or use the 1-hour TTL. This replaces the older max_tokens: 1 workaround (no single-token reply to discard, no output tokens billed, intent is unambiguous).