Inference Runtime
How to reason about production serving for LLMs, STT, embeddings and rerankers.
Problem
Serving a model is not the same as operating inference as a platform capability.
Symptoms
- Throughput looks good in a benchmark but TTFT is bad in product.
- GPU utilization is high but successful task cost is worse than MaaS.
- New models are launched by copying flags from chat history.
Mental model
Inference runtime is a product of model, hardware, scheduler, batching, memory, cache, routing and workload shape.
Architecture
The runtime layer includes OpenAI-compatible serving, model workers, batching, KV-cache, prefix cache, quantization, speculative decoding, multi-GPU serving, autoscaling, capacity planning and non-LLM models such as STT, embeddings and rerankers.
Metrics
Track TTFT, TPOT, total latency, tokens per second, queue time, batch size, cache hit rate, GPU memory, utilization, error rate, cold starts and cost per accepted outcome.
Capacity Is Not The Sum Of GPUs
A cluster can look sufficient by throughput and still fail p99 latency if different request profiles share one pool.
fast
Fast chat
Stable TTFT and short context.
long
Long context
Separate prefill, KV-cache memory and quality checks.
batch
Batch processing
Queue, backlog and cost matter more than interactivity.
critical
TTFT-critical
Dedicated pool, queue limits and its own SLO.
A shared pool often optimizes average utilization while breaking p99 and TTFT for flows where first-token latency is critical.
A dedicated pool costs more in infrastructure, but can be cheaper than trying to fix p99 with global settings that conflict with other workload profiles.
Context Window Is Not Working Context
Advertised max context answers only one question: how many tokens the model can accept. In production, three other questions matter more: how much context fits on the chosen hardware, how much prefill latency it creates and whether quality holds at long lengths.
One million tokens in a spec does not mean one million useful tokens in a scenario. Long windows need separate evals: quality by length, prefill latency, KV-cache memory and cost per accepted outcome. See the public note: Why one million tokens do not solve context.
Trade-offs
Higher throughput can increase latency. Quantization can reduce memory and cost but may degrade quality. Larger context can reduce truncation but increase prefill and cache pressure.
Anti-patterns
- Treating vLLM flags as folklore instead of versioned recipes.
- Choosing hardware by brand debates rather than workload envelope.
- Serving LLMs but ignoring embeddings, STT and rerankers as platform citizens.
Checklist
- ✓Every serving recipe is versioned and reproducible.
- ✓Model launch includes hardware, flags, expected throughput and latency.
- ✓Capacity planning uses real scenario traffic shape.
- ✓Cache budget is sized for the working set, not only max context.
- ✓Separate profiles exist for short-context, long-context and agentic workloads.
- ✓Latency-critical flows are not mixed with batch-heavy and long-context work without an explicit decision.
- ✓Maximum context window is not used as the default path.
- ✓Capacity planning includes prefill, KV-cache and expected prefix reuse.
- ✓Rollout has fallback and quality comparison against the previous alias.
Example
A vLLM recipe should capture model, tensor parallelism, max context, batching limits, tool parser flags, expected VRAM and throughput-vs-latency notes so the next launch is not rebuilt from memory.
Decision template
Record model, hardware, runtime, serving flags, workload, latency SLO, quality gate, cache behavior, capacity envelope and rollback path.