Sergei Notevskii
Gateway
Русская версия

Semantic Router

How to choose the execution path: small model, large model, RAG, agent or human review.

Applied
v0.1
Updated May 23, 2026
AI Platform Leads
Staff Engineers
Backend Engineers
semantic-router
routing
cost
agents
Saved only in this browser.

Problem

Many platforms route only by model_name: model, provider and fallback. In production, the expensive decision is not only which model to call, but which execution path to allow.

Symptoms

  • Every request goes to the agentic loop by default.
  • A simple rewrite starts tools, memory and long context.
  • The router chooses a model, but not whether RAG, reasoning or agentic execution is needed.
  • Prefix cache savings are lost because too much traffic enters the expensive path.

Mental model

Semantic Router is admission control before expensive execution. It decides whether a request should go to direct_small, direct_large, rag, agentic, human_review or policy denial.

Execution path

01

Product request

Scenario, user input and product context.

02

AI Gateway

Contract, quotas, policy and telemetry.

03

Semantic Router

Execution path, not only model choice.

04

Execution lane

direct_small, direct_large, RAG, agentic or human review.

05

Runtime

MaaS, self-hosted pool, batch or fallback.

Architecture

Route policy should be part of the scenario contract: primary lane, self-hosted candidate, fallback, canary share, queue limits, SLO, cost budget and stop criteria.

Route policy

lane

route_lane

direct_small, direct_large, rag, agentic, human_review.

quality

router evals

Errors across direct vs agentic and small vs large.

cost

cost budget

Cost cap, agent step count and fallback path.

latency

latency class

TTFT-critical, interactive, async or batch.

cache

prefix policy

Prompt family, tools_hash and expected hit rate.

safety

tool policy

Allowed tools, human confirmation and action denial.

Metrics

Track route_lane, router_version, router_confidence, false_direct_rate, false_agentic_rate, wrong_model_lane_rate, cost_saved, quality_delta, latency_delta, fallback_rate_by_lane and cache_hit_rate_by_lane.

Trade-offs

Semantic Router lowers cost when it confidently keeps simple requests away from expensive execution. But it has its own cost: classification latency, routing mistakes and the risk of silently degrading quality.

Anti-patterns

  • Every request goes to the agentic lane by default.
  • Router chooses the model but not the execution path.
  • Route metadata is inserted into the prompt prefix and breaks prefix cache.
  • Every intent gets a new system prompt.
  • No evals exist for routing mistakes.

Checklist

  • Scenario route_lane and lane-switching rules are documented.
  • Router does not write dynamic fields into the prompt prefix.
  • false_direct and false_agentic are evaluated.
  • Cost and quality are measured by lane.
  • agentic lane has step, tool and budget limits.

Example

"Rewrite this email shorter" should not start an agent loop with tools and memory. It can go to direct_small. "Find the latest customer actions, compare them with CRM and prepare follow-up" may require agentic or human_review because external state and actions are involved.

Decision template

For each scenario, define route_policy, route_lane, canary_share, fallback_condition, pool_id, queue_priority, latency_class, context_budget, tool_policy and required trace fields.

On this page