Prefix Cache
How prompt shape, tool schemas and routing decide whether cache works in production.
Problem
Cache can be enabled while cached_tokens stays low. The failure is often not the provider, but the request shape.
Symptoms
- Timestamps, request IDs or tenant fields appear near the start.
- Tool lists change order or content between turns.
- Agent loops use shorter dynamic prompts that cost more after cache misses.
- Round-robin routing spreads warm prefixes across replicas.
Mental model
Prefix cache needs three things: the beginning must match, the request must hit a place where the prefix is warm, and the cache must live long enough to be reused.
Architecture
Stabilize system prompt, tool schemas and response formats. Move dynamic fields late or into metadata. Use canonical serialization. Preserve append-only conversation growth when possible. Route repeated prefixes with affinity or prefix-aware routing.
Metrics
Track cached tokens, cache read input tokens, prefix hash, tools count, schema version, TTFT, route, replica, prompt version, warmup count and cache eviction indicators.
Trade-offs
A large stable tool list can be cheaper than a small dynamic list in an agent loop. But large tool lists can also reduce model tool-choice quality, especially on smaller models.
Agentic Tool Surface
In an agent, do not optimize one request in isolation from the whole trajectory. Dynamic tool selection can look cheaper on one step, but if it changes the early prefix, the agent pays for cache misses on every following step. Stabilize prefix first, then reduce prompt size.
| Pattern | Risk | Better |
|---|---|---|
| Dynamic tools on every step | new prefix_hash | stable tools + masking |
| Different tool order | cache miss without semantic change | canonical order |
| request_id in schema | unique prefix | trace metadata outside prompt |
| route-specific system prompt | fragmentation | prompt families |
| tool filtering without evals | cheaper but lower quality | router eval + cache metrics |
Self-hosted Nuance
In a self-hosted boundary, cache hit rate depends not only on prompt shape, but also scheduler behavior, sticky routing, replica count, prefill/decode split and warmup policy. After scale-out, plain round-robin can hurt cacheability more than a prompt change.
Anti-patterns
- Current time in the first system tokens.
- Floating
toolsorder from an unordered registry. - Dynamic
requestIdinside JSON schema. - Rewriting the start of history instead of growing append-only.
- Parallel fan-out before the shared prefix is warm.
- Plain round-robin for long shared prefixes.
Checklist
- ✓Stable instructions and schemas are first.
- ✓Dynamic fields are late or outside the cacheable prefix.
- ✓Tools and response formats serialize deterministically.
- ✓Agent route logs include prefix_hash and tools_count.
- ✓Cache regressions alert on cached token drop and TTFT rise.
Example
In a multi-turn agent, the team reduced the tool subset on every step. Raw prompt length went down, but cached_tokens fell almost to zero: tool order and content kept changing the request prefix. A stable tool declaration with route-level allowed tools restored reuse without exposing every tool to the model.
Decision template
For a cache change, record before/after prompt shape, prefix hash stability, cached token distribution, TTFT, cost per task and quality impact.