Semantic Router
How to choose the execution path: small model, large model, RAG, agent or human review.
Problem
Many platforms route only by model_name: model, provider and fallback. In production, the expensive decision is not only which model to call, but which execution path to allow.
Symptoms
- Every request goes to the agentic loop by default.
- A simple rewrite starts tools, memory and long context.
- The router chooses a model, but not whether RAG, reasoning or agentic execution is needed.
- Prefix cache savings are lost because too much traffic enters the expensive path.
Mental model
Semantic Router is admission control before expensive execution. It decides whether a request should go to direct_small, direct_large, rag, agentic, human_review or policy denial.
01
Product request
Scenario, user input and product context.
02
AI Gateway
Contract, quotas, policy and telemetry.
03
Semantic Router
Execution path, not only model choice.
04
Execution lane
direct_small, direct_large, RAG, agentic or human review.
05
Runtime
MaaS, self-hosted pool, batch or fallback.
Architecture
Route policy should be part of the scenario contract: primary lane, self-hosted candidate, fallback, canary share, queue limits, SLO, cost budget and stop criteria.
lane
route_lane
direct_small, direct_large, rag, agentic, human_review.
quality
router evals
Errors across direct vs agentic and small vs large.
cost
cost budget
Cost cap, agent step count and fallback path.
latency
latency class
TTFT-critical, interactive, async or batch.
cache
prefix policy
Prompt family, tools_hash and expected hit rate.
safety
tool policy
Allowed tools, human confirmation and action denial.
Metrics
Track route_lane, router_version, router_confidence, false_direct_rate, false_agentic_rate, wrong_model_lane_rate, cost_saved, quality_delta, latency_delta, fallback_rate_by_lane and cache_hit_rate_by_lane.
Trade-offs
Semantic Router lowers cost when it confidently keeps simple requests away from expensive execution. But it has its own cost: classification latency, routing mistakes and the risk of silently degrading quality.
Anti-patterns
- Every request goes to the agentic lane by default.
- Router chooses the model but not the execution path.
- Route metadata is inserted into the prompt prefix and breaks prefix cache.
- Every intent gets a new system prompt.
- No evals exist for routing mistakes.
Checklist
- ✓Scenario route_lane and lane-switching rules are documented.
- ✓Router does not write dynamic fields into the prompt prefix.
- ✓false_direct and false_agentic are evaluated.
- ✓Cost and quality are measured by lane.
- ✓agentic lane has step, tool and budget limits.
Example
"Rewrite this email shorter" should not start an agent loop with tools and memory. It can go to direct_small. "Find the latest customer actions, compare them with CRM and prepare follow-up" may require agentic or human_review because external state and actions are involved.
Decision template
For each scenario, define route_policy, route_lane, canary_share, fallback_condition, pool_id, queue_priority, latency_class, context_budget, tool_policy and required trace fields.