LLM Observability Checklist

Problem

LLM observability is not "put it in Grafana". It must connect prompt, model, tokens, latency, cost, tool calls, evals, feedback and ownership.

Symptoms

You can see errors but not prompt versions.
You can see total tokens but not cached tokens.
You can see latency but not TTFT or fallback path.
You can see traces but not scenario success.

Mental model

Every AI call should be debuggable as a product event, model event, cost event and quality event.

At minimum, traces should include request_id, user_id or tenant_id, scenario_id, model, provider, prompt version, system prompt version, input tokens, output tokens, cached tokens, total cost, TTFT, total latency, tool calls, fallback events, safety events, eval score, user feedback and error category.

For self-hosted and routing debugging, add route_lane, router_version, router_confidence, prompt_family, prefix_hash, tools_hash, tools_count, schema_hash, agent_step, static_tokens, dynamic_tokens, cache_read_input_tokens, cache_creation_input_tokens, cache_eligible_tokens, pool_id, replica_id, environment, queue_time, batch_size, decode_profile, gpu_type and routing_reason.

For self-hosted, model and latency are not enough. You need to know which pool, replica, decode profile and queue_time served the request. Otherwise p99 looks like a model problem when the root cause is mixed workload profiles.

Metrics

Trace completeness, missing field rate, scenario latency, TTFT, TPOT, cost per accepted outcome, fallback rate, tool error rate, cache hit rate, safety event rate and feedback score.

Average cached token rate is useful but dangerous: cache regression should be sliced by route, scenario, prompt_family, tools_hash and agent_step.

Trade-offs

Deep traces can expose sensitive data. Production observability needs redaction, retention rules and sanitized examples.

Anti-patterns

Logging full prompts without data boundaries.
Aggregating across scenarios until all useful signals disappear.
Tracking provider cost but not accepted outcomes.
Hiding fallback events inside retry code.

Checklist

✓request_id, tenant_id and scenario_id are present.
✓model, provider, prompt version and system prompt version are present.
✓input, output and cached tokens are separated.
✓TTFT, TPOT and total latency are separated.
✓route_lane, router_version, prefix_hash, tools_hash and agent_step exist for agent scenarios.
✓pool_id, replica_id, queue_time, batch_size and decode_profile exist for self-hosted.
✓tool calls, fallback events, safety events and eval scores are visible.

After Moving To Self-hosted

✓TTFT is split by scenario.
✓p95 and p99 latency are visible by pool.
✓Queue wait time is not mixed with generation time.
✓GPU utilization is shown next to latency, not alone.
✓Batch backlog and backlog age are tracked.
✓Timeouts and errors are split by model.
✓Canary and stable deltas are visible in one review.
✓Each model release has an eval run id.
✓Cost is attributed by scenario, not only by model.

Example

If cache hit rate drops after a deployment, the trace should show whether the cause was prompt version, tools count, schema change, route, replica or traffic mix.

Decision template

For every new route, define required trace fields, redaction policy, dashboard owner, alert thresholds and retention.