MaaS vs Self-hosted

A strategy decision for managed APIs, self-hosted inference and hybrid AI platforms.

Problem

MaaS vs self-hosted is usually framed as religion: closed API or own GPUs. In production it is a strategy decision with different failure modes.

Symptoms

The team wants cheaper tokens but has no capacity model.
Leadership wants data control but underestimates reliability work.
Product teams need model choice but lack a gateway contract.

Mental model

Use managed APIs for speed, breadth and research loops. Use self-hosted when control, data boundary, economics, latency or product strategy justify owning inference operations.

The main distinction: MaaS vs self-hosted is not a provider choice. It is an operating-model choice for a specific AI scenario.

Do not start a migration by buying GPUs. First, the product team finds a model where the scenario works at all: through MaaS, OpenRouter or another external provider. Then the platform team starts a rough self-hosted candidate, checks quality reproducibility, builds latency and cost profiles, runs evals and only then starts canary or a dedicated pool.

Migration playbook

Discovery

MaaS, OpenRouter or another external provider.

Draft self-hosted

Rough launch without production promises.

Baseline

Quality, latency and cost of the current route.

Evals

Baseline comparison and stop criteria.

Canary

Traffic share, fallback and incident owner.

A good migration is not replacing model_name. It proves that the scenario keeps quality, SLO and economics inside the new boundary.

Architecture

Option	Use when	Trade-off
MaaS	Fast iteration, broad model access, uncertain demand.	Less control over routing, cache internals and provider economics.
Self-hosted	Stable demand, data boundary, latency/cost control or custom serving needs.	You own GPU capacity, uptime, upgrades and incidents.
Hybrid	Production needs control but research still needs model breadth.	Requires gateway, routing policy and clear model lifecycle.

Scenario-level Decision

The mistake is deciding "we are self-hosted now" or "we are MaaS now." The right level is a concrete AI scenario: data, SLA, volume, model quality, engineering cost and processing mode.

Condition	Usually better
Spiky demand	MaaS
Frontier model required	MaaS
Data can be de-identified	MaaS or hybrid
Data cannot leave the boundary	self-hosted or on-premise
Stable high-volume workload	self-hosted
Task is not urgent	batch or deferred processing
Model customization is required	self-hosted
No evals and MLOps yet	be careful: self-hosted is early

Metrics

Compare cost per accepted outcome, latency distribution, availability, quality gate pass rate, cache hit rate, utilization, engineering load and incident risk.

Trade-offs

Self-hosted can reduce marginal cost but increase fixed cost. MaaS can accelerate evaluation but hide cache and routing behavior. Hybrid can be best but only if the gateway prevents product teams from seeing the complexity.

Moving between MaaS, self-hosted and hybrid changes more than token price. Cache semantics change: TTL, write/read pricing, cache locality, routing affinity, eviction and available observability fields. Count the migration with a new expected hit rate, not only with a new model price.

Anti-patterns

Moving to GPUs because a spreadsheet says tokens are cheaper.
Counting only production GPUs and forgetting that staging, test and debug instances cost money while serving no user traffic.
Keeping every product team on direct provider integrations.
Running self-hosted models without model release notes, evals or rollback.

Checklist

✓Demand is predictable enough for capacity planning.
✓Quality is measured by scenario, not only model benchmark.
✓Gateway can route between providers and self-hosted aliases.
✓Fallback behavior is defined before migration.
✓Self-hosted cost includes production, stage, test/debug, canary/rollout capacity and peak reserve.
✓Cost comparison includes engineering, observability, evals and incidents.

Example

A customer support summarization flow might stay on MaaS during research, move heavy stable traffic to self-hosted inference, and keep MaaS fallback for spikes or quality regressions. That is not a permanent migration; it is a routing policy.

Decision template

Document scenario, data boundary, demand shape, model candidates, cost model, expected cache hit rate, cache TTL, affinity strategy, prefix-aware routing, cache observability fields, capacity assumptions, quality gate, fallback and owner.