MaaS vs Self-hosted
A strategy decision for managed APIs, self-hosted inference and hybrid AI platforms.
Problem
MaaS vs self-hosted is usually framed as religion: closed API or own GPUs. In production it is a strategy decision with different failure modes.
Symptoms
- The team wants cheaper tokens but has no capacity model.
- Leadership wants data control but underestimates reliability work.
- Product teams need model choice but lack a gateway contract.
Mental model
Use managed APIs for speed, breadth and research loops. Use self-hosted when control, data boundary, economics, latency or product strategy justify owning inference operations.
The main distinction: MaaS vs self-hosted is not a provider choice. It is an operating-model choice for a specific AI scenario.
Safe Scenario Migration
Do not start a migration by buying GPUs. First, the product team finds a model where the scenario works at all: through MaaS, OpenRouter or another external provider. Then the platform team starts a rough self-hosted candidate, checks quality reproducibility, builds latency and cost profiles, runs evals and only then starts canary or a dedicated pool.
01
Discovery
MaaS, OpenRouter or another external provider.
02
Draft self-hosted
Rough launch without production promises.
03
Baseline
Quality, latency and cost of the current route.
04
Evals
Baseline comparison and stop criteria.
05
Canary
Traffic share, fallback and incident owner.
A good migration is not replacing model_name. It proves that the scenario keeps quality, SLO and economics inside the new boundary.
Architecture
| Option | Use when | Trade-off |
|---|---|---|
| MaaS | Fast iteration, broad model access, uncertain demand. | Less control over routing, cache internals and provider economics. |
| Self-hosted | Stable demand, data boundary, latency/cost control or custom serving needs. | You own GPU capacity, uptime, upgrades and incidents. |
| Hybrid | Production needs control but research still needs model breadth. | Requires gateway, routing policy and clear model lifecycle. |
Scenario-level Decision
The mistake is deciding "we are self-hosted now" or "we are MaaS now." The right level is a concrete AI scenario: data, SLA, volume, model quality, engineering cost and processing mode.
| Condition | Usually better |
|---|---|
| Spiky demand | MaaS |
| Frontier model required | MaaS |
| Data can be de-identified | MaaS or hybrid |
| Data cannot leave the boundary | self-hosted or on-premise |
| Stable high-volume workload | self-hosted |
| Task is not urgent | batch or deferred processing |
| Model customization is required | self-hosted |
| No evals and MLOps yet | be careful: self-hosted is early |
Metrics
Compare cost per accepted outcome, latency distribution, availability, quality gate pass rate, cache hit rate, utilization, engineering load and incident risk.
Trade-offs
Self-hosted can reduce marginal cost but increase fixed cost. MaaS can accelerate evaluation but hide cache and routing behavior. Hybrid can be best but only if the gateway prevents product teams from seeing the complexity.
Moving between MaaS, self-hosted and hybrid changes more than token price. Cache semantics change: TTL, write/read pricing, cache locality, routing affinity, eviction and available observability fields. Count the migration with a new expected hit rate, not only with a new model price.
Anti-patterns
- Moving to GPUs because a spreadsheet says tokens are cheaper.
- Counting only production GPUs and forgetting that staging, test and debug instances cost money while serving no user traffic.
- Keeping every product team on direct provider integrations.
- Running self-hosted models without model release notes, evals or rollback.
Checklist
- ✓Demand is predictable enough for capacity planning.
- ✓Quality is measured by scenario, not only model benchmark.
- ✓Gateway can route between providers and self-hosted aliases.
- ✓Fallback behavior is defined before migration.
- ✓Self-hosted cost includes production, stage, test/debug, canary/rollout capacity and peak reserve.
- ✓Cost comparison includes engineering, observability, evals and incidents.
Example
A customer support summarization flow might stay on MaaS during research, move heavy stable traffic to self-hosted inference, and keep MaaS fallback for spikes or quality regressions. That is not a permanent migration; it is a routing policy.
Decision template
Document scenario, data boundary, demand shape, model candidates, cost model, expected cache hit rate, cache TTL, affinity strategy, prefix-aware routing, cache observability fields, capacity assumptions, quality gate, fallback and owner.