Sergei Notevskii
Caching
Русская версия

Prefix Cache

How prompt shape, tool schemas and routing decide whether cache works in production.

Deep dive
v0.1
Updated May 23, 2026
AI Platform Leads
Staff Engineers
ML Platform Engineers
prefix-cache
prompt-cache
kv-cache
tools
Saved only in this browser.

Problem

Cache can be enabled while cached_tokens stays low. The failure is often not the provider, but the request shape.

Symptoms

  • Timestamps, request IDs or tenant fields appear near the start.
  • Tool lists change order or content between turns.
  • Agent loops use shorter dynamic prompts that cost more after cache misses.
  • Round-robin routing spreads warm prefixes across replicas.

Mental model

Prefix cache needs three things: the beginning must match, the request must hit a place where the prefix is warm, and the cache must live long enough to be reused.

Architecture

Stabilize system prompt, tool schemas and response formats. Move dynamic fields late or into metadata. Use canonical serialization. Preserve append-only conversation growth when possible. Route repeated prefixes with affinity or prefix-aware routing.

Metrics

Track cached tokens, cache read input tokens, prefix hash, tools count, schema version, TTFT, route, replica, prompt version, warmup count and cache eviction indicators.

Trade-offs

A large stable tool list can be cheaper than a small dynamic list in an agent loop. But large tool lists can also reduce model tool-choice quality, especially on smaller models.

Agentic Tool Surface

In an agent, do not optimize one request in isolation from the whole trajectory. Dynamic tool selection can look cheaper on one step, but if it changes the early prefix, the agent pays for cache misses on every following step. Stabilize prefix first, then reduce prompt size.

PatternRiskBetter
Dynamic tools on every stepnew prefix_hashstable tools + masking
Different tool ordercache miss without semantic changecanonical order
request_id in schemaunique prefixtrace metadata outside prompt
route-specific system promptfragmentationprompt families
tool filtering without evalscheaper but lower qualityrouter eval + cache metrics

Self-hosted Nuance

In a self-hosted boundary, cache hit rate depends not only on prompt shape, but also scheduler behavior, sticky routing, replica count, prefill/decode split and warmup policy. After scale-out, plain round-robin can hurt cacheability more than a prompt change.

Anti-patterns

  • Current time in the first system tokens.
  • Floating tools order from an unordered registry.
  • Dynamic requestId inside JSON schema.
  • Rewriting the start of history instead of growing append-only.
  • Parallel fan-out before the shared prefix is warm.
  • Plain round-robin for long shared prefixes.

Checklist

  • Stable instructions and schemas are first.
  • Dynamic fields are late or outside the cacheable prefix.
  • Tools and response formats serialize deterministically.
  • Agent route logs include prefix_hash and tools_count.
  • Cache regressions alert on cached token drop and TTFT rise.

Example

In a multi-turn agent, the team reduced the tool subset on every step. Raw prompt length went down, but cached_tokens fell almost to zero: tool order and content kept changing the request prefix. A stable tool declaration with route-level allowed tools restored reuse without exposing every tool to the model.

Decision template

For a cache change, record before/after prompt shape, prefix hash stability, cached token distribution, TTFT, cost per task and quality impact.

On this page