MLOps & DevOps for AI Systems

Cost & Governance for AI Infrastructure

18 min Lesson 9 of 28

Cost & Governance for AI Infrastructure

At every other point in a typical DevOps career, you optimize for latency, reliability, and velocity. In ML infrastructure those three still matter — but a fourth dimension dominates the operational agenda: cost. A single A100 GPU hour on any major cloud costs between $3 and $6. A 500-node training run can invoice $200 000 before the model has shipped a single prediction. Inference fleets for a large LLM serving 50 million daily active users routinely run $10 M/month. Governance — knowing which team spent what, on which model, approved by whom, with what compliance trail — is the discipline that keeps that spend defensible and the organization legally protected.

This lesson covers the three operational pillars that separate a mature ML platform from an uncontrolled research playground: GPU cost control at the infrastructure layer, model governance across the model lifecycle, and responsible rollout practices that reduce both technical and organizational risk when a new model touches production users.

GPU Cost Control

GPU waste is the ML equivalent of idle EC2 instances — except the per-unit cost is 30x higher. The four highest-leverage controls are right-sizing, spot/preemptible instances, bin-packing, and idle detection.

Right-sizing. Model developers routinely request A100s out of habit even when a T4 or A10G would suffice. Enforce a tiered request form: developers must specify why the requested SKU is necessary. Use GPU memory profiling (nvidia-smi --query-gpu=memory.used --format=csv or the dcgm-exporter Prometheus integration) to build a heat map of actual peak utilization by team. Jobs that peak below 40% of GPU memory should be migrated down a tier. At Google, this is enforced via Borg quotas; at AWS and Azure, it maps to instance-family quotas per IAM role.

Spot and preemptible instances. Training is the largest GPU bill and it is inherently resumable via checkpointing. On AWS, Spot is 60–80% cheaper than On-Demand for P4d/P3 instances. The operational requirement is a clean fault-tolerance contract: your training code must checkpoint to S3 every N steps and resume on restart. The Kubernetes karpenter node pool or a Spot-aware SageMaker training job handles the interruption signal (SIGTERM with a 2-minute drain window) and re-queues the job. Inference on Spot is viable only for stateless, sharded models behind a load balancer — and even then, maintain 30% On-Demand base capacity to absorb reclamation spikes.

Bin-packing with MIG and MPS. NVIDIA Multi-Instance GPU (MIG) partitions an A100 into up to 7 independent GPU instances, each with its own memory slice and execution engines. Multi-Process Service (MPS) allows multiple CUDA processes to time-share a single GPU context. For inference workloads where no single model saturates the GPU, MIG and MPS can achieve 4–6x better hardware utilization. Kubernetes exposes MIG slices as extended resources (nvidia.com/mig-3g.20gb); request them in your Pod spec exactly as you would a full GPU.

Idle detection and auto-shutdown. Jupyter notebook servers left running overnight are the classic waste sink. A platform-level idle watcher (the jupyterhub-idle-culler sidecar, or a custom controller that polls the Kernel gateway /api/kernels endpoint) should terminate any session idle for more than 30 minutes during business hours and 60 minutes overnight. Extend this to training jobs: if a job has not consumed GPU cycles in the last 10 minutes (detectable via dcgm_fi_dev_gpu_util dropping to zero), emit an alert and optionally cancel the job — a hung training loop silently billing $500/hour is a common production incident.

ML Platform Cost Governance Architecture Cost Control Layer Governance Layer Spot/Preemptible 60-80% savings MIG / MPS Bin-Pack 4-6x utilization Right-Sizing dcgm-exporter metrics Idle Detection auto-shutdown hooks Chargeback / Showback Engine Kubernetes labels → team cost allocation Budget Alerts & Quotas threshold → Slack / block new jobs Model Registry + Approval Gate MLflow / SageMaker Model Registry Lineage Tracking data → model → serving Bias / Fairness Checks CI gate pre-promotion Canary / Shadow Rollout traffic split + business metric SLOs Audit Log & RBAC who promoted what, when, approved by
ML Platform cost control and governance layers — cost visibility on the left feeds approval and rollout controls on the right.

Chargeback and Showback

Cost control without attribution is invisible. The standard approach is to label every GPU-using Kubernetes workload with team, project, model-name, and env labels, then join node cost data from the cloud billing API with Prometheus metrics by node and namespace. Tools like Kubecost or OpenCost perform this join automatically and expose a per-namespace cost breakdown. Pipe that into your internal finance dashboard for showback (teams see their bill but are not charged back), or hook it into a quota system for true chargeback.

Budget alerts are the operational enforcement mechanism. Configure a cloud billing alert at 80% of the monthly GPU budget; at 100%, the platform controller should stop accepting new training job submissions from that team until a budget exception is approved. This is not punitive — it is the only mechanism that makes ML spend a first-class engineering concern rather than a retrospective accounting surprise.

# terraform/modules/mlplatform/budget_alert.tf # Hard budget guard per team: blocks new jobs when monthly GPU spend crosses threshold. resource "aws_budgets_budget" "ml_team_gpu" { for_each = var.ml_teams # { "search" = 40000, "recommendations" = 60000 } name = "ml-gpu-${each.key}" budget_type = "COST" limit_amount = tostring(each.value) limit_unit = "USD" time_unit = "MONTHLY" cost_filter { name = "TagKeyValue" values = ["user:team$${each.key}"] } notification { comparison_operator = "GREATER_THAN" threshold = 80 threshold_type = "PERCENTAGE" notification_type = "ACTUAL" subscriber_email_addresses = ["ml-platform-oncall@example.com"] } notification { comparison_operator = "GREATER_THAN" threshold = 100 threshold_type = "PERCENTAGE" notification_type = "ACTUAL" subscriber_sns_topic_arns = [aws_sns_topic.budget_hard_stop.arn] } } # SNS Lambda patches the team Kubernetes ResourceQuota to 0 GPU when over budget resource "aws_sns_topic" "budget_hard_stop" { name = "ml-budget-hard-stop" }
Production pitfall — untagged spot requests: Spot/preemptible instances launched by SageMaker or Vertex AI managed training jobs are billed under the service's own IAM principal, not your team tag. If your chargeback system relies purely on resource tags, you will systematically under-attribute managed training costs. Cross-reference the billing API's UsageType dimension (e.g., ml.p3.2xlarge) with the SageMaker job metadata to map costs back to teams.

Model Governance

Governance is the organizational contract around a model: what data was it trained on, who approved it for production, what fairness and bias checks did it pass, and what is the rollback plan if it causes harm? These are not bureaucratic formalities — they are legal requirements in an expanding set of jurisdictions (the EU AI Act, US Executive Order 14110, and NIST AI RMF all mandate some form of model documentation and auditability).

A model registry is the technical anchor for governance. Every model artifact that reaches staging or production should be registered with at minimum: the training dataset lineage hash (a content-addressed pointer to the exact data version), the experiment run ID from your tracking tool, the performance metrics from the evaluation suite, the approval status (pending / approved / rejected), the name of the approving engineer or committee, and the timestamp. MLflow Model Registry and SageMaker Model Registry both support these fields natively; you wire them into your CI/CD pipeline via an approval gate that blocks promotion to production unless status == "Approved".

Fairness and bias checks are a CI gate, not a post-deployment review. Run them in your model CI pipeline using tools like Fairlearn (demographic parity, equalized odds), Aequitas, or Google's Model Cards Toolkit. If a protected attribute (age, gender, race, geography) causes the model's performance to degrade beyond an agreed disparity threshold for a subgroup, the CI run fails and the model cannot be promoted. Define these thresholds explicitly in a model_governance.yaml configuration file that lives in the model's Git repository alongside the training code — this makes the policy reviewable and version-controlled.

# model_governance.yaml — committed alongside training code in the model repo. # Evaluated by the CI fairness gate before any production promotion. model_name: credit-risk-v4 owner_team: financial-ml compliance_tags: [eu-ai-act-high-risk, gdpr-article-22] fairness_constraints: protected_attributes: [age_group, gender, country_region] metrics: demographic_parity_difference: threshold: 0.05 # max allowed gap across groups fail_on_breach: true equalized_odds_difference: threshold: 0.08 fail_on_breach: true performance_floor: auc_roc: 0.82 # model must not degrade below this globally precision_at_k: 0.75 data_lineage: training_dataset_sha256: "a3f8c21..." feature_store_version: "v2.7.3" training_cutoff_date: "2025-01-01" approval_required: staging: ["ml-lead"] production: ["ml-lead", "risk-compliance"] channels: ["#ml-approvals"]
Key idea — model cards as living documents: A model card is a one-page structured summary of a model's intended use, performance across subgroups, limitations, and ethical considerations. Google popularized them in 2019; they are now the de facto standard artifact for high-risk model governance. Automate model card generation from your registry metadata and evaluation results so it is always current — a stale model card is worse than none because it creates false confidence.

Responsible Rollout

Even a well-governed model that passed every offline evaluation can cause harm in production if deployed carelessly. Responsible rollout means treating a model promotion with the same rigour as a service deployment: canary stages, business-metric SLOs, automated rollback, and shadow mode validation.

Shadow mode runs the new model in parallel with the current production model, logging both outputs without serving the new model to users. This is the safest validation step: you can compare output distributions, latency profiles, and compute costs at real traffic volume with zero user impact. Shadow mode is mandatory for any model with a direct user-facing effect at significant scale. Implement it by routing a copy of every inference request to a shadow replica and storing both responses in a comparison log table.

Canary rollout exposes the new model to a small percentage of live traffic (typically 1% → 5% → 20% → 100% in a staged progression) while monitoring business metrics — not just technical metrics. For a recommendation model, the business SLO might be: click-through rate must not drop more than 0.5% relative, and add-to-cart conversion must not drop more than 1% relative, measured over a minimum 24-hour window at each stage. If any SLO breaches, the rollout pauses and the platform pages on-call. This progression is encoded as an Argo Rollout or a Flagger canary object in Kubernetes — the same progressive delivery tooling you use for application services, applied to model inference deployments.

# kubernetes/rollouts/fraud-model-canary.yaml # Argo Rollouts canary for a model inference Deployment. # Promotes through 5% → 20% → 50% → 100% with automated analysis gates. apiVersion: argoproj.io/v1alpha1 kind: Rollout metadata: name: fraud-model-v4 namespace: ml-inference spec: replicas: 20 selector: matchLabels: app: fraud-model template: metadata: labels: app: fraud-model spec: containers: - name: model-server image: registry.example.com/fraud-model:v4.2.1 resources: limits: nvidia.com/gpu: "1" strategy: canary: steps: - setWeight: 5 - pause: {duration: 1h} - analysis: templates: - templateName: model-business-slo - setWeight: 20 - pause: {duration: 2h} - analysis: templates: - templateName: model-business-slo - setWeight: 50 - pause: {duration: 4h} - setWeight: 100 --- apiVersion: argoproj.io/v1alpha1 kind: AnalysisTemplate metadata: name: model-business-slo namespace: ml-inference spec: metrics: - name: click-through-rate-delta interval: 10m successCondition: "result[0] >= -0.005" # no more than 0.5% CTR drop failureLimit: 2 provider: prometheus: address: http://prometheus:9090 query: | ( sum(rate(ml_prediction_accepted_total{variant="canary"}[10m])) / sum(rate(ml_prediction_served_total{variant="canary"}[10m])) ) / ( sum(rate(ml_prediction_accepted_total{variant="stable"}[10m])) / sum(rate(ml_prediction_served_total{variant="stable"}[10m])) ) - 1
Pro practice — separate business SLOs from technical SLOs for canary gates: Technical SLOs (latency p99, error rate) should be a pre-condition for starting the canary at all — if the new model cannot serve requests cleanly, do not expose it to users. Business SLOs (conversion rate, engagement, fraud catch rate) are the canary gate conditions: these detect the subtler model quality regressions that pass all technical checks. Mixing the two in a single analysis template makes it hard to diagnose which category failed and slows remediation.

Governance at the LLM Scale

Large language models introduce governance challenges that do not exist for classical ML. The primary concerns are prompt injection and jailbreak (input governance), output filtering (response governance), data residency (legal compliance for user inputs sent to a model API), and cost attribution for token spend.

At the infrastructure layer, wrap every LLM call through a gateway (examples: Portkey, LiteLLM proxy, AWS Bedrock Guardrails) that enforces: per-team token budgets, content filtering rules, PII detection and redaction in both request and response, and an immutable audit log of every inference call. Token spend should feed back into your chargeback engine: input_tokens * $X + output_tokens * $Y per API call, attributed via the request header team tag. At Google DeepMind and OpenAI internal platforms, this gateway pattern is mandatory — no team calls the model API directly.

Model watermarking and output provenance (cryptographically signing model outputs so you can prove they came from a specific model version) is an emerging governance requirement driven by the EU AI Act and NIST's AI RMF. Tools like SynthID (Google) and open-source watermarking libraries for HuggingFace are the current state of practice, though this space is evolving rapidly.

Key idea — governance debt compounds: Every model deployed without a registry entry, lineage record, and approval trail is governance debt. Unlike technical debt, governance debt does not just slow you down — it can result in regulatory fines, reputational damage from a discriminatory model surfacing publicly, or inability to comply with a data subject access request (GDPR Article 22). Build the governance pipeline before you scale the model count, not after.