Ownership and Operating Model
Who owns production AI scenarios, platform layers, cost, quality and incidents.
Problem
Production AI fails when ownership is split across model research, product teams, infra and support without an operating model.
Symptoms
- Product owns UX but not model quality.
- Platform owns serving but not scenario outcomes.
- Nobody owns cost review.
- Incidents are debugged by whoever knows the prompt.
Mental model
AI Platform should enable product teams without becoming a bottleneck. That requires clear RACI, golden paths and escalation rules.
During self-hosted migration, the product team owns the scenario, success criteria and initial model discovery. The platform owns the self-hosted candidate, capacity, latency profile, eval gate, fallback and operations.
Without this split, migration becomes either research without production responsibility or an infrastructure project without product success criteria.
Architecture
Operating model includes scenario intake, architecture review, model release process, quality review, cost review, incident process, hiring profiles, support model, onboarding and platform DevEx.
A production AI scenario needs owners for route policy, context budget, tool policy and cache strategy. Otherwise cost and quality become nobody's job: product sees UX, platform sees GPUs, but nobody owns the full agent trajectory.
quality
Owner: scenario quality
Product success criterion and dataset.
route
Owner: route policy
direct, RAG, agentic, human review and fallback.
prompt
Owner: prompt family
Prompt versions and model compatibility.
tools
Owner: tool registry
Tool stability, allowed tools and audit.
cache
Owner: cache hit SLO
Prefix stability, route affinity and regression.
cost
Owner: cost review
Scenario cost and accepted outcome.
What MaaS Used To Hide
MaaS removes part of the operating responsibility: scaling, access to new models, part of guardrails, runtime operations and upgrades. In self-hosted setups, these questions move back into the team.
| MaaS hides | Self-hosted returns |
|---|---|
| scale-out | capacity planning |
| runtime updates | vLLM and runtime upgrades |
| part of guardrails | owned policies |
| model lifecycle | release process |
| provider SLA | owned SLO |
| token bill | people, hardware and process |
Principle
Self-hosted gives control, but returns responsibility.
Metrics
Track scenario onboarding time, self-service adoption, incident MTTR, cost review coverage, eval coverage, number of undocumented exceptions and platform support load.
Trade-offs
Central platform ownership improves consistency but can slow teams. Product ownership improves speed but can fragment safety, cost and telemetry. Mature organizations split responsibilities explicitly.
Anti-patterns
- Platform team approves every prompt manually.
- Product teams own provider keys directly.
- Incidents have no model, prompt or scenario owner.
- Hiring profile ignores evaluation and platform DevEx skills.
Checklist
- ✓Every scenario has product owner and platform owner.
- ✓Model releases have quality and cost review.
- ✓Incident process includes prompt/model/tool rollback.
- ✓Product teams have SDKs, templates and playgrounds.
- ✓Scenario owners exist for route policy, context budget, tool policy and cache strategy.
- ✓Exceptions to the platform path are tracked and reviewed.
Example
A platform team can own gateway, model aliases, eval tooling and observability. Product teams own scenario acceptance, UX, business feedback and product-specific risk. Both sign off on rollout.
Decision template
For each scenario, document RACI, platform contract, product owner, quality owner, cost owner, incident owner and escalation path.