# Prompt Caching — Design & Optimization This file covers how to design prompt-building code for effective caching. For language-specific syntax, see the `## Prompt Caching` section in each language's README or single-file doc. ## The one invariant everything follows from **Prompt caching is a prefix match. Any change anywhere in the prefix invalidates everything after it.** The cache key is derived from the exact bytes of the rendered prompt up to each `cache_control` breakpoint. A single byte difference at position N — a timestamp, a reordered JSON key, a different tool in the list — invalidates the cache for all breakpoints at positions ≥ N. Render order is: `tools` → `system` → `messages`. A breakpoint on the last system block caches both tools and system together. Design the prompt-building path around this constraint. Get the ordering right and most caching works for free. Get it wrong and no amount of `cache_control` markers will help. --- ## Workflow for optimizing existing code When asked to add or optimize caching: 1. **Trace the prompt assembly path.** Find where `system`, `tools`, and `messages` are constructed. Identify every input that flows into them. 2. **Classify each input by stability:** - Never changes → belongs early in the prompt, before any breakpoint - Changes per-session → belongs after the global prefix, cache per-session - Changes per-turn → belongs at the end, after the last breakpoint - Changes per-request (timestamps, UUIDs, random IDs) → **eliminate or move to the very end** 3. **Check rendered order matches stability order.** Stable content must physically precede volatile content. If a timestamp is interpolated into the system prompt header, everything after it is uncacheable regardless of markers. 4. **Place breakpoints at stability boundaries.** See placement patterns below. 5. **Audit for silent invalidators.** See anti-patterns table. --- ## Placement patterns ### Large system prompt shared across many requests Put a breakpoint on the last system text block. If there are tools, they render before system — the marker on the last system block caches tools + system together. ```json "system": [ {"type": "text", "text": "", "cache_control": {"type": "ephemeral"}} ] ``` ### Multi-turn conversations Put a breakpoint on the last content block of the most-recently-appended turn. Each subsequent request reuses the entire prior conversation prefix. Earlier breakpoints remain valid read points, so hits accrue incrementally as the conversation grows. ```json // Last content block of the last user turn messages[-1].content[-1].cache_control = {"type": "ephemeral"} ``` ### Shared prefix, varying suffix Many requests share a large fixed preamble (few-shot examples, retrieved docs, instructions) but differ in the final question. Put the breakpoint at the end of the **shared** portion, not at the end of the whole prompt — otherwise every request writes a distinct cache entry and nothing is ever read. ```json "messages": [{"role": "user", "content": [ {"type": "text", "text": "", "cache_control": {"type": "ephemeral"}}, {"type": "text", "text": ""} // no marker — differs every time ]}] ``` ### Mid-conversation system messages **Beta, model-gated.** When an operator instruction arrives mid-conversation — a mode switch, updated context, dynamically injected state — send it as `{"role": "system", "content": "..."}` appended to `messages[]`, rather than editing top-level `system`. Editing top-level `system` changes the prefix ahead of the entire conversation history, so every cached turn is re-processed uncached; a `role: "system"` message sits after the history and leaves the cached prefix intact. ```json // Top-level system stays byte-identical; new instruction goes after the cached history "system": [{"type": "text", "text": "", "cache_control": {"type": "ephemeral"}}], "messages": [ ...history, {"role": "user", "content": "..."}, {"role": "system", "content": "Terse mode enabled — keep responses under 40 words."} ] ``` This is also the prompt-injection-safe replacement for embedding operator instructions as text inside a user turn (the `` pattern): both have the same caching profile, but `role: "system"` is the non-spoofable operator channel, whereas text inside user/tool content can be forged by anything that writes to user-visible input. Requires `anthropic-beta: mid-conversation-system-2026-04-07`. Must follow a `role: "user"` message (or an assistant message ending in a server tool result); cannot be `messages[0]` — use top-level `system` for the initial prompt. Content is text-only. Model-gated — unsupported models return a 400 (`BadRequestError`: `role 'system' is not supported on this model`); catch that error and fall back to putting the instruction in a user-turn `` block. ### Prompts that change from the beginning every time Don't cache. If the first 1K tokens differ per request, there is no reusable prefix. Adding `cache_control` only pays the cache-write premium with zero reads. Leave it off. --- ## Architectural guidance These are the decisions that matter more than marker placement. Fix these first. **Keep the system prompt frozen.** Don't interpolate "current date: X", "mode: Y", "user name: Z" into the system prompt — those sit at the front of the prefix and invalidate everything downstream. Inject dynamic context later in `messages` instead — as a `{"role": "system", ...}` message where supported (see § Mid-conversation system messages above), or as text in a user message otherwise. A message at turn 5 invalidates nothing before turn 5. **Don't change tools or model mid-conversation.** Tools render at position 0; adding, removing, or reordering a tool invalidates the entire cache. Same for switching models (caches are model-scoped). If you need "modes", don't swap the tool set — give Claude a tool that records the mode transition, or pass the mode as message content. Serialize tools deterministically (sort by name). **Fork operations must reuse the parent's exact prefix.** Side computations (summarization, compaction, sub-agents) often spin up a separate API call. If the fork rebuilds `system` / `tools` / `model` with any difference, it misses the parent's cache entirely. Copy the parent's `system`, `tools`, and `model` verbatim, then append fork-specific content at the end. --- ## Silent invalidators When reviewing code, grep for these inside anything that feeds the prompt prefix: | Pattern | Why it breaks caching | |---|---| | `datetime.now()` / `Date.now()` / `time.time()` in system prompt | Prefix changes every request | | `uuid4()` / `crypto.randomUUID()` / request IDs early in content | Same — every request is unique | | `json.dumps(d)` without `sort_keys=True` / iterating a `set` | Non-deterministic serialization → prefix bytes differ | | f-string interpolating session/user ID into system prompt | Per-user prefix; no cross-user sharing | | Conditional system sections (`if flag: system += ...`) | Every flag combination is a distinct prefix | | `tools=build_tools(user)` where set varies per user | Tools render at position 0; nothing caches across users | Fix by moving the dynamic piece after the last breakpoint, making it deterministic, or deleting it if it's not load-bearing. --- ## API reference ```json "cache_control": {"type": "ephemeral"} // 5-minute TTL (default) "cache_control": {"type": "ephemeral", "ttl": "1h"} // 1-hour TTL ``` - Max **4** `cache_control` breakpoints per request. - Goes on any content block: system text blocks, tool definitions, message content blocks (`text`, `image`, `tool_use`, `tool_result`, `document`). - Top-level `cache_control` on `messages.create()` auto-places on the last cacheable block — simplest option when you don't need fine-grained placement. - Minimum cacheable prefix is model-dependent. Shorter prefixes silently won't cache even with a marker — no error, just `cache_creation_input_tokens: 0`: | Model | Minimum | |---|---:| | Opus 4.8, Opus 4.7, Opus 4.6, Opus 4.5, Haiku 4.5 | 4096 tokens | | Sonnet 4.6, Haiku 3.5, Haiku 3 | 2048 tokens | | Sonnet 4.5, Sonnet 4.1, Sonnet 4, Sonnet 3.7 | 1024 tokens | A 3K-token prompt caches on Sonnet 4.5 but silently won't on Opus 4.8. **Economics:** Cache reads cost ~0.1× base input price. Cache writes cost **1.25× for 5-minute TTL, 2× for 1-hour TTL**. Break-even depends on TTL: with 5-minute TTL, two requests break even (1.25× + 0.1× = 1.35× vs 2× uncached); with 1-hour TTL, you need at least three requests (2× + 0.2× = 2.2× vs 3× uncached). The 1-hour TTL keeps entries alive across gaps in bursty traffic, but the doubled write cost means it needs more reads to pay off. --- ## Verifying cache hits The response `usage` object reports cache activity: | Field | Meaning | |---|---| | `cache_creation_input_tokens` | Tokens written to cache this request (you paid the ~1.25× write premium) | | `cache_read_input_tokens` | Tokens served from cache this request (you paid ~0.1×) | | `input_tokens` | Tokens processed at full price (not cached) | If `cache_read_input_tokens` is zero across repeated requests with identical prefixes, a silent invalidator is at work — diff the rendered prompt bytes between two requests to find it. **`input_tokens` is the uncached remainder only.** Total prompt size = `input_tokens + cache_creation_input_tokens + cache_read_input_tokens`. If your agent ran for hours but `input_tokens` shows 4K, the rest was served from cache — check the sum, not the single field. Language-specific access: `response.usage.cache_read_input_tokens` (Python/TS/Ruby), `$message->usage->cacheReadInputTokens` (PHP), `resp.Usage.CacheReadInputTokens` (Go/C#), `.usage().cacheReadInputTokens()` (Java). --- ## Invalidation hierarchy Not every parameter change invalidates everything. The API has three cache tiers, and changes only invalidate their own tier and below: | Change | Tools cache | System cache | Messages cache | |---|:---:|:---:|:---:| | Tool definitions (add/remove/reorder) | ❌ | ❌ | ❌ | | Model switch | ❌ | ❌ | ❌ | | `speed`, web-search, citations toggle | ✅ | ❌ | ❌ | | System prompt content | ✅ | ❌ | ❌ | | `tool_choice`, images, `thinking` enable/disable | ✅ | ✅ | ❌ | | Message content | ✅ | ✅ | ❌ | Implication: you can change `tool_choice` per-request or toggle `thinking` without losing the tools+system cache. Don't over-worry about these — only tool-definition and model changes force a full rebuild. --- ## 20-block lookback window Each breakpoint walks backward **at most 20 content blocks** to find a prior cache entry. If a single turn adds more than 20 blocks (common in agentic loops with many tool_use/tool_result pairs), the next request's breakpoint won't find the previous cache and silently misses. Fix: place an intermediate breakpoint every ~15 blocks in long turns, or put the marker on a block that's within 20 of the previous turn's last cached block. --- ## Concurrent-request timing A cache entry becomes readable only after the first response **begins streaming**. N parallel requests with identical prefixes all pay full price — none can read what the others are still writing. For fan-out patterns: send 1 request, await the first streamed token (not the full response), then fire the remaining N−1. They'll read the cache the first one just wrote. ## Pre-warming the cache To eliminate the cache-miss latency on the *first* real request, send a **`max_tokens: 0`** request at startup (or on an interval). The API runs prefill — writing the cache at your `cache_control` breakpoint — and returns immediately with `content: []`, `stop_reason: "max_tokens"`, and a populated `usage` block (zero output tokens billed; normal cache-write charge on `cache_creation_input_tokens`). **When to pre-warm** — pre-warming trades a cache-write charge *now* for lower TTFT on the *next* real request. It's worth it when all three hold: (a) first-request latency is user-visible (chat/voice/interactive — not background jobs), (b) the shared prefix is large enough that a cold write is noticeably slow, and (c) there's a moment *before* traffic to fire it — app startup, worker boot, post-deploy, start of a scheduled window. | Skip pre-warming when… | Because | |---|---| | Traffic is continuous (requests ≤ TTL apart) | The first real request warms the cache and every subsequent one hits it; a separate warm call is a pure extra write | | The prefix is small or below the cacheable minimum | The cold-write penalty is negligible | | The prefix varies per request/user | Nothing shared to pre-warm | | You'd pre-warm many distinct prefixes speculatively | Each is a ~1.25× write; cost can exceed the latency you save | **Scheduled re-warms:** only needed when traffic has gaps longer than the TTL. If real requests arrive more often than every 5 minutes, they keep the cache warm on their own — don't add an interval re-warm. For bursty traffic with long idle gaps, either re-warm just under the TTL or switch to `ttl: "1h"` and re-warm less often. ```python client.messages.create( model="{{OPUS_ID}}", max_tokens=0, system=[{ "type": "text", "text": SYSTEM_PROMPT, "cache_control": {"type": "ephemeral"}, }], messages=[{"role": "user", "content": "warmup"}], ) ``` **Breakpoint placement:** put `cache_control` on the **last block shared with the real request** (the system prompt or tool definitions) — **not** on the placeholder user message, and **not** via top-level automatic caching (which would key the cache to the placeholder). The placeholder can be any non-whitespace string; it's read during prefill but never answered. **Rejected combinations:** `max_tokens: 0` is an `invalid_request_error` with `stream: true`, `thinking.type: "enabled"`, `output_config.format`, `tool_choice` of `{"type":"tool"}` or `{"type":"any"}`, or inside a Message Batches request. **TTL still applies** — re-warm at least every 5 minutes for the default cache, or use the 1-hour TTL. This replaces the older `max_tokens: 1` workaround (no single-token reply to discard, no output tokens billed, intent is unambiguous).