Sergei Notevskii
Inference
Русская версия

STT Serving

How to treat speech-to-text as part of the AI platform, not a side service.

Applied
v0.1
Updated May 23, 2026
AI Platform Leads
ML Platform Engineers
Backend Engineers
stt
inference
latency
Saved only in this browser.

Problem

STT is often treated as preprocessing before an LLM. In production, it is a separate inference workload with its own queues, latency, quality and cost.

Symptoms

  • Call summaries are poor, but the root cause is transcript quality, not the LLM.
  • Batch throughput is high, but interactive latency is unusable.
  • Audio metadata, transcript quality and downstream AI outcome are not linked.

Mental model

STT is first-class model serving. It needs its own SLOs, quality checks, telemetry, capacity and fallback.

Architecture

The layer includes audio intake, normalization, chunking, model workers, language hints, diarization or speaker labels, transcript storage boundary, quality checks and trace linking to downstream LLM scenarios.

Metrics

Track audio duration, realtime factor, queue time, segment latency, word error proxy, language mix, retry rate, model version, GPU/CPU utilization and cost per accepted transcript.

Trade-offs

Batch throughput and interactive latency conflict. A stronger model can break the cost profile. Aggressive chunking improves speed, but can hurt punctuation, context and downstream summaries.

Anti-patterns

  • Measuring only total processing time.
  • Not linking transcript quality to downstream task quality.
  • Ignoring language mix and noisy audio.
  • Treating STT as "not part of the AI platform".

Checklist

  • STT has its own latency and throughput budget.
  • Transcript quality is linked to downstream outcome.
  • Audio metadata enters the trace.
  • Model version and serving config are logged.
  • Fallback path exists for noisy or long audio.

Example

If a call summary degrades, do not start by rewriting the LLM prompt. First check STT route, language hints, chunking, diarization, realtime factor and transcript quality on failed cases.

Self-hosted STT gives not only data control, but also the ability to add domain signals: extra classes, quality features, emotions or specialized hints. Evaluate such experiments not only by quality, but also by latency regression, realtime factor and downstream LLM impact.

Decision template

For an STT route, document: model, hardware, chunking, language policy, latency SLO, quality proxy, fallback, trace fields and owner.

On this page