MLOps & DevOps for AI Systems

Project: An ML Platform Design

18 min Lesson 10 of 28

Project: An ML Platform Design

The nine preceding lessons equipped you with every major building block: feature pipelines with training-serving skew detection, experiment tracking and model registries, GPU cluster scheduling, CI/CD gates for model promotion, canary and shadow deployment patterns, multi-dimensional production monitoring, LLM-specific operations, and cost/governance controls. This final lesson asks you to assemble those blocks into a coherent, opinionated architecture for a mid-sized team — twelve ML engineers, six MLOps engineers, and forty data scientists — shipping thirty active models into production, with a mandate to grow to one hundred models within eighteen months. That is the realistic constraint set for a Series-C company, a large-enterprise ML team, or a platform team at a mid-tier tech company.

Designing an ML platform is not a tooling decision — it is an organizational and operational contract. Every architectural choice creates a set of abstractions that either accelerates or blocks the teams building on top of you. The best internal ML platforms at Airbnb, Lyft, Spotify, and DoorDash share a common trait: they standardize the boring parts (data access, compute, experiment tracking, deployment) so data scientists can focus entirely on model quality. We will design with that principle as the north star.

Platform Decomposition: Six Planes

A production ML platform decomposes cleanly into six functional planes. Each plane has a primary owner, a set of contracts it exposes to downstream consumers, and a set of SLOs it must meet. Designing plane-by-plane, then wiring the contracts together, prevents the most common failure mode of ML platform projects: building a monolithic toolkit that no team can operate independently when it breaks.

Data Plane: Raw data ingestion, feature computation, the feature store, and point-in-time correct dataset generation. SLO: features available at serving time with p99 latency under 10 ms for online features; training dataset generation completes within four hours of request. Owner: data engineering team jointly with MLOps.
Experiment Plane: Notebook environments, experiment tracking (MLflow or Weights & Biases), hyperparameter sweep orchestration (Optuna, Ray Tune), and the model registry. SLO: experiment metadata writes succeed at 99.9%; any registered model artifact is retrievable within thirty seconds. Owner: MLOps platform team.
Training Plane: GPU/TPU cluster, job scheduler (Kubeflow Pipelines or Argo Workflows), job queue, spot-instance preemption handling, distributed training coordination (Torch DDP / Horovod). SLO: training job launch latency under two minutes; preempted jobs resume within ten minutes. Owner: infra/MLOps joint SRE team.
Deployment Plane: Model serving runtime (Triton, TorchServe, or vLLM for LLMs), the promotion pipeline (staging → canary → production), traffic-split controller, rollback automation. SLO: production deployment completes within thirty minutes of promotion approval; rollback triggers within two minutes of SLO breach. Owner: MLOps deployment team.
Monitoring Plane: Data drift detection, prediction drift, business-metric correlation, retraining triggers, alerting integration with PagerDuty/Opsgenie. SLO: drift reports published within fifteen minutes of inference window close; critical-drift alert fires within five minutes. Owner: MLOps observability team.
Governance Plane: Cost attribution, quota enforcement, audit logging, model cards, explainability artifacts (SHAP values stored per model version), data lineage graph. SLO: cost reports available within one hour of end-of-day; audit log retention seven years (financial-services requirement). Owner: ML platform + security jointly.

Key design principle: Each plane exposes its contract as a versioned API — not a shared library that every team forks. When the feature store schema changes, it publishes a v2 API; downstream consumers migrate on their own schedule. This is the same strangler-fig migration pattern you know from microservices, applied to internal platform surfaces.

Reference Architecture Diagram

The diagram below shows the complete platform with all six planes and the critical data flows between them. Read it left-to-right for the training path and bottom-to-top for the serving path. The governance plane sits as a cross-cutting concern across all others.

ML Platform reference architecture: six planes with the governance layer as a cross-cutting concern. The dashed lines show the drift-triggered retraining loop and the online feature path to serving.

Infrastructure-as-Code Skeleton

The platform lives on Kubernetes. The pattern that scales from thirty to one hundred models is a shared cluster with namespace-per-team isolation, not a cluster-per-model. Each team gets a Kubernetes namespace with ResourceQuotas for GPU, CPU, and memory. A shared node pool of spot A100 instances handles training jobs; an on-demand node pool of g5.2xlarge instances handles serving. Terraform manages the cluster and node pools; Helm charts or Kustomize layers manage the platform components.

# terraform/modules/ml-platform/main.tf  — EKS cluster with two node pools

module "eks" {
  source          = "terraform-aws-modules/eks/aws"
  version         = "~> 20.0"
  cluster_name    = "ml-platform-prod"
  cluster_version = "1.30"

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets

  eks_managed_node_groups = {

    # Training: spot A100 instances — preemptible, cheaper
    training = {
      instance_types  = ["p4d.24xlarge"]
      capacity_type   = "SPOT"
      min_size        = 0
      max_size        = 20
      desired_size    = 2
      labels = { role = "training" }
      taints = [{ key = "nvidia.com/gpu", value = "true", effect = "NO_SCHEDULE" }]
    }

    # Serving: on-demand GPU for low-latency inference
    serving = {
      instance_types  = ["g5.2xlarge"]
      capacity_type   = "ON_DEMAND"
      min_size        = 2
      max_size        = 40
      desired_size    = 4
      labels = { role = "serving" }
      taints = [{ key = "nvidia.com/gpu", value = "true", effect = "NO_SCHEDULE" }]
    }

    # CPU workers: feature pipelines, drift jobs, orchestration
    cpu-workers = {
      instance_types = ["m6i.4xlarge"]
      capacity_type  = "ON_DEMAND"
      min_size       = 3
      max_size       = 30
      desired_size   = 6
    }
  }
}

# Per-team namespace with GPU quota  (repeat for each team)
resource "kubernetes_namespace" "team_recommendations" {
  metadata { name = "team-recommendations" }
}

resource "kubernetes_resource_quota" "team_recommendations" {
  metadata { name = "gpu-quota"; namespace = "team-recommendations" }
  spec {
    hard = {
      "requests.nvidia.com/gpu" = "8"
      "limits.nvidia.com/gpu"   = "8"
      "requests.cpu"            = "128"
      "requests.memory"         = "512Gi"
    }
  }
}

The Promotion Pipeline as Code

The promotion pipeline is the contractual boundary between experimentation and production. It must be triggered from the model registry (not from a human opening a PR), it must be fully automated up to the canary gate, and it must be blocked by any of the following: evaluation metrics below threshold, data drift score above 0.2, schema mismatch between registry artifact and serving contract, or missing model card. Encode all of this as a GitHub Actions workflow that fires when a model is moved to Staging in the registry.

# .github/workflows/model-promote.yml
# Triggered when MLflow registry webhook fires on transition to Staging

name: Model Promotion Gate

on:
  repository_dispatch:
    types: [mlflow-model-staging]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install evaluation deps
        run: pip install mlflow great_expectations evidently pandas

      - name: Pull candidate model metadata
        id: meta
        run: |
          python scripts/fetch_model_meta.py \
            --model-name "${{ github.event.client_payload.model_name }}" \
            --version "${{ github.event.client_payload.version }}" \
            --out model_meta.json
          echo "accuracy=$(jq -r .metrics.accuracy model_meta.json)" >> $GITHUB_OUTPUT
          echo "f1=$(jq -r .metrics.f1 model_meta.json)"             >> $GITHUB_OUTPUT

      - name: Fail if accuracy below threshold
        run: |
          python -c "
          import sys
          acc = float('${{ steps.meta.outputs.accuracy }}')
          if acc < 0.92:
              print(f'FAIL: accuracy {acc} < 0.92')
              sys.exit(1)
          print(f'OK: accuracy {acc}')
          "

      - name: Check feature schema against serving contract
        run: |
          python scripts/schema_check.py \
            --candidate model_meta.json \
            --contract contracts/fraud_model_v3.yaml

      - name: Run drift gate against last-7-day production data
        run: |
          python scripts/promotion_drift_check.py \
            --model-name "${{ github.event.client_payload.model_name }}"

      - name: Promote to Production (canary 5%)
        if: success()
        run: |
          mlflow models transition-stage \
            --model-name "${{ github.event.client_payload.model_name }}" \
            --version "${{ github.event.client_payload.version }}" \
            --stage Production
          kubectl apply -f k8s/canary/fraud-canary-5pct.yaml

Day-Two Operations: Failure Modes to Design For

A platform design is only as good as its failure modes. The ones that kill ML platforms at scale are not the obvious hardware failures — Kubernetes handles those. They are the subtle coupling failures that the platform architect must explicitly design against:

Feature store cold-start on rollback: You roll back a model from v4 to v3. But v3 was trained on features that no longer exist in the feature store because the feature pipeline was also upgraded. Solution: the model registry stores a feature_spec.yaml alongside every artifact, and the feature store enforces that every named feature has a retention policy of at least ninety days. Rollback CI checks feature availability before flipping traffic.
Experiment tracker as a single point of failure: If MLflow tracking server goes down, training jobs fail — or worse, succeed silently without logging. Solution: MLflow in high-availability mode behind a load balancer with two PostgreSQL writer replicas, and a circuit breaker in the training job wrapper that degrades gracefully to local SQLite logging if the tracking server is unreachable, then syncs on job completion.
GPU quota starvation: One team submits a sweep of five hundred hyperparameter trials on spot instances. Spot capacity is exhausted. Serving nodes are evicted when AWS reclaims the pool. Solution: hard separation of training and serving node groups with distinct spot pools and interruption handlers. Serving is on on-demand instances only. Training jobs have priorityClassName: batch-low; the autoscaler will not evict serving pods to accommodate them.
Stale model card in production: A regulatory audit asks for the training data provenance and fairness evaluation for a model that went to production fourteen months ago. The model card was filed but references an S3 path that was cleaned up. Solution: model cards are immutable objects stored in the registry itself (not object storage), with full training dataset SHA hashes and evaluation dataset hashes, and deletion is blocked by a Kubernetes admission webhook while any production deployment references that model version.

Pro practice — the platform contract document: Before writing a single line of Terraform, write a one-page platform contract for each plane: what it promises, what it does not promise, and what teams must do themselves. Distribute it to every ML engineer who will use the platform. The most expensive ML platform failures happen when data scientists assume the platform handles something (like point-in-time correctness in dataset generation) that the platform team silently skipped. The contract document makes implicit assumptions explicit.

Production pitfall — the shadow mode gap: Shadow mode testing routes live traffic to the new model without affecting user-facing outputs. Teams frequently skip it to meet a deadline, then discover in canary that the model has a 40× latency regression on a rare input shape that never appeared in offline evaluation. At big-tech companies, shadow mode is non-negotiable for any model with p99 latency SLOs. Build it into your promotion pipeline as a mandatory twenty-four-hour stage that cannot be bypassed without an explicit incident-commander override.

Scaling from 30 to 100 Models

The thirty-model platform you design today will stress in predictable ways as it grows. The platform changes that matter most at the hundred-model mark are: (1) a model metadata catalog — a searchable index of all models, their owners, their input/output schemas, their SLOs, and their cost per inference — because finding who owns a model becomes a real operational problem; (2) a self-service training workflow template that data scientists can parameterize without touching YAML, because the MLOps team becomes a bottleneck if every new model requires bespoke pipeline work; and (3) automated cost attribution by model and team exposed as a weekly digest, because GPU costs that were acceptable at thirty models are budget-threatening at one hundred. Design those three capabilities into the initial architecture even if you do not implement them on day one. Retrofitting them onto an existing platform is an order of magnitude more expensive than building the hooks for them upfront.