AI Quality Gate

Problem

"The model seems good" is not a quality process.

Symptoms

Model changes ship without scenario datasets.
Prompt edits are not regression-tested.
LLM-as-a-judge scores exist but nobody trusts failures.
Production feedback never updates the dataset.

Mental model

Quality Gate is not one score. It is a loop that prevents the platform from degrading silently.

Architecture

Dataset -> eval suite -> error taxonomy -> prompt/model change -> regression -> canary -> production feedback -> dataset update.

Evals In Self-hosted Migration

When migrating a scenario, evals do not replace initial discovery. First, the team finds a model where the scenario can work at all. Then the platform starts a rough self-hosted version and checks that the success reproduces locally. Only after that do evals become a release gate: comparing quality, latency, cost and regressions against the current route.

Otherwise the team risks buying or renting GPUs for curiosity, not for proven scenario fit.

Metrics

Measure scenario pass rate, critical error rate, regression delta, judge agreement, manual review load, canary stop events, feedback acceptance and cost per passed task.

Trade-offs

Strict gates slow releases but prevent invisible degradation. Loose gates preserve speed but push quality risk to users. The right gate depends on scenario risk.

Smaller Models Require More Engineering

When a team moves from a frontier MaaS model to an open-source or self-hosted model, inference may become cheaper while scenario tuning becomes more expensive. Work that the large model used to absorb moves into prompts, context, validation, post-processing and evals.

The team usually has to strengthen:

prompt engineering;
context engineering;
scenario simplification;
post-processing;
validation;
model and quantization selection;
eval datasets;
sometimes fine-tuning.

Core risk

Savings on the model often turn into team work.

Long-context And Router Evals

For long context, evaluate not only retrieval but reasoning through noise: conflicting fragments, outdated versions, distractors, multiple relevant documents and synthesis requirements.

For Semantic Router, quality is not only answer quality. It is also whether the selected lane was right: direct_small should not answer when an agent is needed, and agentic should not run for a simple rewrite or summary.

Router eval metrics

direct

false_direct_rate

Router chose an undersized path.

agentic

false_agentic_rate

Router started an unnecessary agent loop.

cost

cost_saved_without_quality_loss

Savings without quality degradation.

Anti-patterns

Golden datasets that never receive production failures.
One aggregate score that hides critical categories.
Judge prompts that change without versioning.
Canary rollout without stop criteria.

Checklist

✓Scenario dataset exists and has owners.
✓Critical failures are separated from cosmetic issues.
✓Prompt, model and judge versions are tracked.
✓Canary has stop criteria and rollback path.
✓Before self-hosted migration, the current route has a quality baseline.
✓A rough local candidate is compared with baseline before canary.
✓Semantic Router is checked for false_direct and false_agentic.
✓Long-context scenarios are checked for distractors, conflicting facts and stale context.
✓Production feedback updates the dataset.

Example

A new summarization model can pass generic benchmarks and still fail the product scenario: the summary sounds fluent, but loses action items and next steps. The gate should check scenario examples, critical error categories and a task-level acceptance metric.

Decision template

Every rollout should include dataset, metrics, threshold, regression delta, canary plan, fallback, owner and release note.

AI Quality Gate

On this page