mirror of https://github.com/Piebald-AI/claude-code-system-prompts.git synced 2026-05-30 21:54:18 +08:00

Mike a9eee87708 v2.1.83 (+5,960 tokens)

2026-03-25 13:29:27 -06:00

7.4 KiB

Raw Blame History

Prompt Caching — Design & Optimization

This file covers how to design prompt-building code for effective caching. For language-specific syntax, see the ## Prompt Caching section in each language's README or single-file doc.

The one invariant everything follows from

Prompt caching is a prefix match. Any change anywhere in the prefix invalidates everything after it.

The cache key is derived from the exact bytes of the rendered prompt up to each cache_control breakpoint. A single byte difference at position N — a timestamp, a reordered JSON key, a different tool in the list — invalidates the cache for all breakpoints at positions ≥ N.

Render order is: tools → system → messages. A breakpoint on the last system block caches both tools and system together.

Design the prompt-building path around this constraint. Get the ordering right and most caching works for free. Get it wrong and no amount of cache_control markers will help.

Workflow for optimizing existing code

When asked to add or optimize caching:

Trace the prompt assembly path. Find where system, tools, and messages are constructed. Identify every input that flows into them.
Classify each input by stability:
- Never changes → belongs early in the prompt, before any breakpoint
- Changes per-session → belongs after the global prefix, cache per-session
- Changes per-turn → belongs at the end, after the last breakpoint
- Changes per-request (timestamps, UUIDs, random IDs) → eliminate or move to the very end
Check rendered order matches stability order. Stable content must physically precede volatile content. If a timestamp is interpolated into the system prompt header, everything after it is uncacheable regardless of markers.
Place breakpoints at stability boundaries. See placement patterns below.
Audit for silent invalidators. See anti-patterns table.

Placement patterns

Large system prompt shared across many requests

Put a breakpoint on the last system text block. If there are tools, they render before system — the marker on the last system block caches tools + system together.

"system": [
  {"type": "text", "text": "<large shared prompt>", "cache_control": {"type": "ephemeral"}}
]

Multi-turn conversations

Put a breakpoint on the last content block of the most-recently-appended turn. Each subsequent request reuses the entire prior conversation prefix. Earlier breakpoints remain valid read points, so hits accrue incrementally as the conversation grows.

// Last content block of the last user turn
messages[-1].content[-1].cache_control = {"type": "ephemeral"}

Shared prefix, varying suffix

Many requests share a large fixed preamble (few-shot examples, retrieved docs, instructions) but differ in the final question. Put the breakpoint at the end of the shared portion, not at the end of the whole prompt — otherwise every request writes a distinct cache entry and nothing is ever read.

"messages": [{"role": "user", "content": [
  {"type": "text", "text": "<shared context>", "cache_control": {"type": "ephemeral"}},
  {"type": "text", "text": "<varying question>"}  // no marker — differs every time
]}]

Prompts that change from the beginning every time

Don't cache. If the first 1K tokens differ per request, there is no reusable prefix. Adding cache_control only pays the cache-write premium with zero reads. Leave it off.

Architectural guidance

These are the decisions that matter more than marker placement. Fix these first.

Keep the system prompt frozen. Don't interpolate "current date: X", "mode: Y", "user name: Z" into the system prompt — those sit at the front of the prefix and invalidate everything downstream. Inject dynamic context as a user or assistant message later in messages. A message at turn 5 invalidates nothing before turn 5.

Don't change tools or model mid-conversation. Tools render at position 0; adding, removing, or reordering a tool invalidates the entire cache. Same for switching models (caches are model-scoped). If you need "modes", don't swap the tool set — give Claude a tool that records the mode transition, or pass the mode as message content. Serialize tools deterministically (sort by name).

Fork operations must reuse the parent's exact prefix. Side computations (summarization, compaction, sub-agents) often spin up a separate API call. If the fork rebuilds system / tools / model with any difference, it misses the parent's cache entirely. Copy the parent's system, tools, and model verbatim, then append fork-specific content at the end.

Silent invalidators

When reviewing code, grep for these inside anything that feeds the prompt prefix:

Pattern	Why it breaks caching
`datetime.now()` / `Date.now()` / `time.time()` in system prompt	Prefix changes every request
`uuid4()` / `crypto.randomUUID()` / request IDs early in content	Same — every request is unique
`json.dumps(d)` without `sort_keys=True` / iterating a `set`	Non-deterministic serialization → prefix bytes differ
f-string interpolating session/user ID into system prompt	Per-user prefix; no cross-user sharing
Conditional system sections (`if flag: system += ...`)	Every flag combination is a distinct prefix
`tools=build_tools(user)` where set varies per user	Tools render at position 0; nothing caches across users

Fix by moving the dynamic piece after the last breakpoint, making it deterministic, or deleting it if it's not load-bearing.

API reference

"cache_control": {"type": "ephemeral"}              // 5-minute TTL (default)
"cache_control": {"type": "ephemeral", "ttl": "1h"} // 1-hour TTL

Max 4 cache_control breakpoints per request.
Goes on any content block: system text blocks, tool definitions, message content blocks (text, image, tool_use, tool_result, document).
Top-level cache_control on messages.create() auto-places on the last cacheable block — simplest option when you don't need fine-grained placement.
Minimum cacheable prefix is model-dependent (typically 1024–2048 tokens). Shorter prefixes silently won't cache even with a marker.

Economics: Cache writes cost ~1.25× base input price; reads cost ~0.1×. A prefix must be used in at least two requests within TTL to break even (one writes the cache, subsequent ones read it). For bursty traffic, the 1-hour TTL keeps entries alive across gaps.

Verifying cache hits

The response usage object reports cache activity:

Field	Meaning
`cache_creation_input_tokens`	Tokens written to cache this request (you paid the ~1.25× write premium)
`cache_read_input_tokens`	Tokens served from cache this request (you paid ~0.1×)
`input_tokens`	Tokens processed at full price (not cached)

If cache_read_input_tokens is zero across repeated requests with identical prefixes, a silent invalidator is at work — diff the rendered prompt bytes between two requests to find it.

Language-specific access: response.usage.cache_read_input_tokens (Python/TS/Ruby), $message->usage->cacheReadInputTokens (PHP), resp.Usage.CacheReadInputTokens (Go/C#), .usage().cacheReadInputTokens() (Java).

7.4 KiB Raw Blame History Unescape Escape