Prefix Cache

How prompt shape, tool schemas and routing decide whether cache works in production.

Problem

Cache can be enabled while cached_tokens stays low. The failure is often not the provider, but the request shape.

Symptoms

Timestamps, request IDs or tenant fields appear near the start.
Tool lists change order or content between turns.
Agent loops use shorter dynamic prompts that cost more after cache misses.
Round-robin routing spreads warm prefixes across replicas.

Mental model

Prefix cache needs three things: the beginning must match, the request must hit a place where the prefix is warm, and the cache must live long enough to be reused.

Stabilize system prompt, tool schemas and response formats. Move dynamic fields late or into metadata. Use canonical serialization. Preserve append-only conversation growth when possible. Route repeated prefixes with affinity or prefix-aware routing.

Metrics

Track cached tokens, cache read input tokens, prefix hash, tools count, schema version, TTFT, route, replica, prompt version, warmup count and cache eviction indicators.

Trade-offs

A large stable tool list can be cheaper than a small dynamic list in an agent loop. But large tool lists can also reduce model tool-choice quality, especially on smaller models.

Agentic Tool Surface

In an agent, do not optimize one request in isolation from the whole trajectory. Dynamic tool selection can look cheaper on one step, but if it changes the early prefix, the agent pays for cache misses on every following step. Stabilize prefix first, then reduce prompt size.

Pattern	Risk	Better
Dynamic tools on every step	new prefix_hash	stable tools + masking
Different tool order	cache miss without semantic change	canonical order
request_id in schema	unique prefix	trace metadata outside prompt
route-specific system prompt	fragmentation	prompt families
tool filtering without evals	cheaper but lower quality	router eval + cache metrics

Self-hosted Nuance

In a self-hosted boundary, cache hit rate depends not only on prompt shape, but also scheduler behavior, sticky routing, replica count, prefill/decode split and warmup policy. After scale-out, plain round-robin can hurt cacheability more than a prompt change.

Anti-patterns

Current time in the first system tokens.
Floating tools order from an unordered registry.
Dynamic requestId inside JSON schema.
Rewriting the start of history instead of growing append-only.
Parallel fan-out before the shared prefix is warm.
Plain round-robin for long shared prefixes.

Checklist

✓Stable instructions and schemas are first.
✓Dynamic fields are late or outside the cacheable prefix.
✓Tools and response formats serialize deterministically.
✓Agent route logs include prefix_hash and tools_count.
✓Cache regressions alert on cached token drop and TTFT rise.

Example

In a multi-turn agent, the team reduced the tool subset on every step. Raw prompt length went down, but cached_tokens fell almost to zero: tool order and content kept changing the request prefix. A stable tool declaration with route-level allowed tools restored reuse without exposing every tool to the model.

Decision template

For a cache change, record before/after prompt shape, prefix hash stability, cached token distribution, TTFT, cost per task and quality impact.