Inference Economics
Why AI cost must be measured by accepted outcomes, not raw token price.
Problem
LLM cost is often reduced to tokens * price. That misses cache hits, retries, fallbacks, failed tasks, latency impact and engineering cost.
Symptoms
- A cheaper model increases retries.
- A shorter prompt loses prefix cache and becomes more expensive.
- Finance sees provider cost but product cannot see cost per scenario.
Mental model
AI Cost is provider cost plus GPU amortization, engineering cost, observability and evals, quality loss, latency impact, incident cost and opportunity cost.
Larger Models Are Not Always Linearly More Expensive
If a model needs 8x more GPUs, the scenario is not automatically 8x more expensive. Real unit cost depends on throughput, batchability, cache hit rate, answer length, retry count, result quality and accepted outcome rate.
The opposite side matters too: larger models usually need not only production instances, but also test, debug and stage capacity. These resources do not serve user traffic, but they belong to scenario economics.
Effective cost =
S * ((1 - h) * P_miss + h * P_hit)
+ D * P_miss
+ O * P_outWhere S is the stable repeated part, D is the dynamic tail, O is output tokens, h is hit rate, P_miss is regular input price, P_hit is cached input price and P_out is output price.
The model price list does not know your hit rate. Comparing models only by input/output price is incomplete math.
Architecture
Cost attribution needs gateway metadata, provider pricing, cached token fields, scenario IDs, quality outcomes, retries, fallback events and business acceptance signals.
Production is not the whole fleet
In self-hosted inference, production GPUs are not the full cost. Staging, test, debug, canary and reserve capacity do not directly serve user requests, but they still belong to the scenario cost.
Scenario cost =
production inference
+ non-production capacity
+ reserve
+ eval and benchmark runs
+ engineering time
+ operationsCompare the cost of a successful answer inside the required SLA, not dollars per GPU or dollars per million tokens.
Metrics
The strongest metric is cost per accepted outcome: accepted CRM email, correct transcript, useful summary, task passing quality gate or agent loop that completes within budget.
Trade-offs
Optimizing for cost per token can hurt quality. Optimizing for the best model can hurt throughput. Optimizing prompt length can hurt cache. Mature economics balances successful outcomes, latency and operating risk.
Anti-patterns
- Choosing models by price table only.
- Ignoring cached input token pricing.
- Treating retries as invisible.
- Counting successful API calls instead of successful user tasks.
Checklist
- ✓Every route has scenario_id and owner metadata.
- ✓Cost reports separate input, cached input and output tokens.
- ✓Retries and fallbacks are included in scenario cost.
- ✓Quality failures are visible in cost reviews.
- ✓Unit cost is calculated by scenario, not model size or GPU count.
- ✓Non-production boundaries and rollout/canary reserve are included in cost.
- ✓The dashboard can show cost per accepted outcome.
Example
Two models can have the same raw input and output token count. If one preserves cached prefixes across agent turns and the other misses cache due to tool drift, their effective cost can diverge sharply.
Decision template
Write cost decisions as: scenario, model/provider options, raw token cost, cache-adjusted cost, retry rate, quality pass rate, latency, accepted outcome rate and owner.