Ownership and Operating Model

Problem

Production AI fails when ownership is split across model research, product teams, infra and support without an operating model.

Symptoms

Product owns UX but not model quality.
Platform owns serving but not scenario outcomes.
Nobody owns cost review.
Incidents are debugged by whoever knows the prompt.

Mental model

AI Platform should enable product teams without becoming a bottleneck. That requires clear RACI, golden paths and escalation rules.

During self-hosted migration, the product team owns the scenario, success criteria and initial model discovery. The platform owns the self-hosted candidate, capacity, latency profile, eval gate, fallback and operations.

Without this split, migration becomes either research without production responsibility or an infrastructure project without product success criteria.

Architecture

Operating model includes scenario intake, architecture review, model release process, quality review, cost review, incident process, hiring profiles, support model, onboarding and platform DevEx.

A production AI scenario needs owners for route policy, context budget, tool policy and cache strategy. Otherwise cost and quality become nobody's job: product sees UX, platform sees GPUs, but nobody owns the full agent trajectory.

AI scenario RACI

quality

Owner: scenario quality

Product success criterion and dataset.

route

Owner: route policy

direct, RAG, agentic, human review and fallback.

prompt

Owner: prompt family

Prompt versions and model compatibility.

tools

Owner: tool registry

Tool stability, allowed tools and audit.

cache

Owner: cache hit SLO

Prefix stability, route affinity and regression.

cost

Owner: cost review

Scenario cost and accepted outcome.

What MaaS Used To Hide

MaaS removes part of the operating responsibility: scaling, access to new models, part of guardrails, runtime operations and upgrades. In self-hosted setups, these questions move back into the team.

MaaS hides	Self-hosted returns
scale-out	capacity planning
runtime updates	vLLM and runtime upgrades
part of guardrails	owned policies
model lifecycle	release process
provider SLA	owned SLO
token bill	people, hardware and process

Principle

Self-hosted gives control, but returns responsibility.

Metrics

Track scenario onboarding time, self-service adoption, incident MTTR, cost review coverage, eval coverage, number of undocumented exceptions and platform support load.

Trade-offs

Central platform ownership improves consistency but can slow teams. Product ownership improves speed but can fragment safety, cost and telemetry. Mature organizations split responsibilities explicitly.

Anti-patterns

Platform team approves every prompt manually.
Product teams own provider keys directly.
Incidents have no model, prompt or scenario owner.
Hiring profile ignores evaluation and platform DevEx skills.

Checklist

✓Every scenario has product owner and platform owner.
✓Model releases have quality and cost review.
✓Incident process includes prompt/model/tool rollback.
✓Product teams have SDKs, templates and playgrounds.
✓Scenario owners exist for route policy, context budget, tool policy and cache strategy.
✓Exceptions to the platform path are tracked and reviewed.

Example

A platform team can own gateway, model aliases, eval tooling and observability. Product teams own scenario acceptance, UX, business feedback and product-specific risk. Both sign off on rollout.

Decision template

For each scenario, document RACI, platform contract, product owner, quality owner, cost owner, incident owner and escalation path.

Ownership and Operating Model

On this page