claude-code-system-prompts/system-prompts/data-prompt-caching-design-optimization.md
2026-03-25 13:29:27 -06:00

7.4 KiB
Raw Blame History

Prompt Caching — Design & Optimization

This file covers how to design prompt-building code for effective caching. For language-specific syntax, see the ## Prompt Caching section in each language's README or single-file doc.

The one invariant everything follows from

Prompt caching is a prefix match. Any change anywhere in the prefix invalidates everything after it.

The cache key is derived from the exact bytes of the rendered prompt up to each cache_control breakpoint. A single byte difference at position N — a timestamp, a reordered JSON key, a different tool in the list — invalidates the cache for all breakpoints at positions ≥ N.

Render order is: toolssystemmessages. A breakpoint on the last system block caches both tools and system together.

Design the prompt-building path around this constraint. Get the ordering right and most caching works for free. Get it wrong and no amount of cache_control markers will help.


Workflow for optimizing existing code

When asked to add or optimize caching:

  1. Trace the prompt assembly path. Find where system, tools, and messages are constructed. Identify every input that flows into them.
  2. Classify each input by stability:
    • Never changes → belongs early in the prompt, before any breakpoint
    • Changes per-session → belongs after the global prefix, cache per-session
    • Changes per-turn → belongs at the end, after the last breakpoint
    • Changes per-request (timestamps, UUIDs, random IDs) → eliminate or move to the very end
  3. Check rendered order matches stability order. Stable content must physically precede volatile content. If a timestamp is interpolated into the system prompt header, everything after it is uncacheable regardless of markers.
  4. Place breakpoints at stability boundaries. See placement patterns below.
  5. Audit for silent invalidators. See anti-patterns table.

Placement patterns

Large system prompt shared across many requests

Put a breakpoint on the last system text block. If there are tools, they render before system — the marker on the last system block caches tools + system together.

"system": [
  {"type": "text", "text": "<large shared prompt>", "cache_control": {"type": "ephemeral"}}
]

Multi-turn conversations

Put a breakpoint on the last content block of the most-recently-appended turn. Each subsequent request reuses the entire prior conversation prefix. Earlier breakpoints remain valid read points, so hits accrue incrementally as the conversation grows.

// Last content block of the last user turn
messages[-1].content[-1].cache_control = {"type": "ephemeral"}

Shared prefix, varying suffix

Many requests share a large fixed preamble (few-shot examples, retrieved docs, instructions) but differ in the final question. Put the breakpoint at the end of the shared portion, not at the end of the whole prompt — otherwise every request writes a distinct cache entry and nothing is ever read.

"messages": [{"role": "user", "content": [
  {"type": "text", "text": "<shared context>", "cache_control": {"type": "ephemeral"}},
  {"type": "text", "text": "<varying question>"}  // no marker — differs every time
]}]

Prompts that change from the beginning every time

Don't cache. If the first 1K tokens differ per request, there is no reusable prefix. Adding cache_control only pays the cache-write premium with zero reads. Leave it off.


Architectural guidance

These are the decisions that matter more than marker placement. Fix these first.

Keep the system prompt frozen. Don't interpolate "current date: X", "mode: Y", "user name: Z" into the system prompt — those sit at the front of the prefix and invalidate everything downstream. Inject dynamic context as a user or assistant message later in messages. A message at turn 5 invalidates nothing before turn 5.

Don't change tools or model mid-conversation. Tools render at position 0; adding, removing, or reordering a tool invalidates the entire cache. Same for switching models (caches are model-scoped). If you need "modes", don't swap the tool set — give Claude a tool that records the mode transition, or pass the mode as message content. Serialize tools deterministically (sort by name).

Fork operations must reuse the parent's exact prefix. Side computations (summarization, compaction, sub-agents) often spin up a separate API call. If the fork rebuilds system / tools / model with any difference, it misses the parent's cache entirely. Copy the parent's system, tools, and model verbatim, then append fork-specific content at the end.


Silent invalidators

When reviewing code, grep for these inside anything that feeds the prompt prefix:

Pattern Why it breaks caching
datetime.now() / Date.now() / time.time() in system prompt Prefix changes every request
uuid4() / crypto.randomUUID() / request IDs early in content Same — every request is unique
json.dumps(d) without sort_keys=True / iterating a set Non-deterministic serialization → prefix bytes differ
f-string interpolating session/user ID into system prompt Per-user prefix; no cross-user sharing
Conditional system sections (if flag: system += ...) Every flag combination is a distinct prefix
tools=build_tools(user) where set varies per user Tools render at position 0; nothing caches across users

Fix by moving the dynamic piece after the last breakpoint, making it deterministic, or deleting it if it's not load-bearing.


API reference

"cache_control": {"type": "ephemeral"}              // 5-minute TTL (default)
"cache_control": {"type": "ephemeral", "ttl": "1h"} // 1-hour TTL
  • Max 4 cache_control breakpoints per request.
  • Goes on any content block: system text blocks, tool definitions, message content blocks (text, image, tool_use, tool_result, document).
  • Top-level cache_control on messages.create() auto-places on the last cacheable block — simplest option when you don't need fine-grained placement.
  • Minimum cacheable prefix is model-dependent (typically 10242048 tokens). Shorter prefixes silently won't cache even with a marker.

Economics: Cache writes cost ~1.25× base input price; reads cost ~0.1×. A prefix must be used in at least two requests within TTL to break even (one writes the cache, subsequent ones read it). For bursty traffic, the 1-hour TTL keeps entries alive across gaps.


Verifying cache hits

The response usage object reports cache activity:

Field Meaning
cache_creation_input_tokens Tokens written to cache this request (you paid the ~1.25× write premium)
cache_read_input_tokens Tokens served from cache this request (you paid ~0.1×)
input_tokens Tokens processed at full price (not cached)

If cache_read_input_tokens is zero across repeated requests with identical prefixes, a silent invalidator is at work — diff the rendered prompt bytes between two requests to find it.

Language-specific access: response.usage.cache_read_input_tokens (Python/TS/Ruby), $message->usage->cacheReadInputTokens (PHP), resp.Usage.CacheReadInputTokens (Go/C#), .usage().cacheReadInputTokens() (Java).