Embeddings Serving

How to serve embeddings and rerankers as a platform layer with quality, latency and lifecycle.

Problem

Embeddings and rerankers often sit in the shadow of LLMs. But they define retrieval quality, classification behavior, latency and cost for many AI scenarios.

Symptoms

RAG "answers poorly", but retrieval quality is not measured.
An embedding model changes without a reindex plan.
A classification model lives outside model lifecycle.
A reranker is added, but latency budget is not recalculated.

Mental model

An embedding model is not a utility. It is part of model lifecycle: version, dataset, quality gate, index compatibility, rollback and cost profile.

Architecture

The layer includes embedding API, batching, index compatibility, reindex pipeline, reranker serving, model aliases, evaluation dataset, latency budget and trace fields for the retrieval step.

Metrics

Track recall@k, precision@k, reranker lift, index freshness, embedding latency, reranker latency, batch size, cost per query, model version and reindex duration.

Trade-offs

A stronger embedding model can require reindexing. A reranker improves quality, but adds latency and cost. A small model is faster, but can break long-tail retrieval.

Semantic Routing As The Next Optimization Layer

Semantic Router can reduce cost when part of the traffic confidently goes to a cheaper model or cached path. But it is not a free optimization: the router adds latency overhead, classification errors and risk of breaking agent flow. Evaluate it as a model: route accuracy, added latency, fallback rate and cost per accepted outcome.

Anti-patterns

Changing an embedding model without a reindex strategy.
Evaluating only the final LLM answer, not the retrieval step.
Not logging retrieved document IDs.
Treating classifier serving as "not a platform concern".

Checklist

✓Embedding model has alias and version.
✓Reindex plan and rollback path exist.
✓Retrieval quality is measured separately from LLM answer.
✓Reranker latency is part of the scenario SLO.
✓Trace shows query, model version, retrieved IDs and reranker score.

Example

If a RAG scenario degrades after an embedding model change, do not check only the prompt. You need recall@k, index version, reindex completeness, reranker score distribution and traces for failed queries.

Decision template

For an embedding route, document: model alias, index compatibility, reindex plan, eval dataset, latency SLO, rollback path, trace fields and owner.

Embeddings Serving

On this page