Project: An ML Platform Design
Project: An ML Platform Design
The nine preceding lessons equipped you with every major building block: feature pipelines with training-serving skew detection, experiment tracking and model registries, GPU cluster scheduling, CI/CD gates for model promotion, canary and shadow deployment patterns, multi-dimensional production monitoring, LLM-specific operations, and cost/governance controls. This final lesson asks you to assemble those blocks into a coherent, opinionated architecture for a mid-sized team — twelve ML engineers, six MLOps engineers, and forty data scientists — shipping thirty active models into production, with a mandate to grow to one hundred models within eighteen months. That is the realistic constraint set for a Series-C company, a large-enterprise ML team, or a platform team at a mid-tier tech company.
Designing an ML platform is not a tooling decision — it is an organizational and operational contract. Every architectural choice creates a set of abstractions that either accelerates or blocks the teams building on top of you. The best internal ML platforms at Airbnb, Lyft, Spotify, and DoorDash share a common trait: they standardize the boring parts (data access, compute, experiment tracking, deployment) so data scientists can focus entirely on model quality. We will design with that principle as the north star.
Platform Decomposition: Six Planes
A production ML platform decomposes cleanly into six functional planes. Each plane has a primary owner, a set of contracts it exposes to downstream consumers, and a set of SLOs it must meet. Designing plane-by-plane, then wiring the contracts together, prevents the most common failure mode of ML platform projects: building a monolithic toolkit that no team can operate independently when it breaks.
- Data Plane: Raw data ingestion, feature computation, the feature store, and point-in-time correct dataset generation. SLO: features available at serving time with p99 latency under 10 ms for online features; training dataset generation completes within four hours of request. Owner: data engineering team jointly with MLOps.
- Experiment Plane: Notebook environments, experiment tracking (MLflow or Weights & Biases), hyperparameter sweep orchestration (Optuna, Ray Tune), and the model registry. SLO: experiment metadata writes succeed at 99.9%; any registered model artifact is retrievable within thirty seconds. Owner: MLOps platform team.
- Training Plane: GPU/TPU cluster, job scheduler (Kubeflow Pipelines or Argo Workflows), job queue, spot-instance preemption handling, distributed training coordination (Torch DDP / Horovod). SLO: training job launch latency under two minutes; preempted jobs resume within ten minutes. Owner: infra/MLOps joint SRE team.
- Deployment Plane: Model serving runtime (Triton, TorchServe, or vLLM for LLMs), the promotion pipeline (staging → canary → production), traffic-split controller, rollback automation. SLO: production deployment completes within thirty minutes of promotion approval; rollback triggers within two minutes of SLO breach. Owner: MLOps deployment team.
- Monitoring Plane: Data drift detection, prediction drift, business-metric correlation, retraining triggers, alerting integration with PagerDuty/Opsgenie. SLO: drift reports published within fifteen minutes of inference window close; critical-drift alert fires within five minutes. Owner: MLOps observability team.
- Governance Plane: Cost attribution, quota enforcement, audit logging, model cards, explainability artifacts (SHAP values stored per model version), data lineage graph. SLO: cost reports available within one hour of end-of-day; audit log retention seven years (financial-services requirement). Owner: ML platform + security jointly.
Reference Architecture Diagram
The diagram below shows the complete platform with all six planes and the critical data flows between them. Read it left-to-right for the training path and bottom-to-top for the serving path. The governance plane sits as a cross-cutting concern across all others.
Infrastructure-as-Code Skeleton
The platform lives on Kubernetes. The pattern that scales from thirty to one hundred models is a shared cluster with namespace-per-team isolation, not a cluster-per-model. Each team gets a Kubernetes namespace with ResourceQuotas for GPU, CPU, and memory. A shared node pool of spot A100 instances handles training jobs; an on-demand node pool of g5.2xlarge instances handles serving. Terraform manages the cluster and node pools; Helm charts or Kustomize layers manage the platform components.
The Promotion Pipeline as Code
The promotion pipeline is the contractual boundary between experimentation and production. It must be triggered from the model registry (not from a human opening a PR), it must be fully automated up to the canary gate, and it must be blocked by any of the following: evaluation metrics below threshold, data drift score above 0.2, schema mismatch between registry artifact and serving contract, or missing model card. Encode all of this as a GitHub Actions workflow that fires when a model is moved to Staging in the registry.
Day-Two Operations: Failure Modes to Design For
A platform design is only as good as its failure modes. The ones that kill ML platforms at scale are not the obvious hardware failures — Kubernetes handles those. They are the subtle coupling failures that the platform architect must explicitly design against:
- Feature store cold-start on rollback: You roll back a model from v4 to v3. But v3 was trained on features that no longer exist in the feature store because the feature pipeline was also upgraded. Solution: the model registry stores a
feature_spec.yamlalongside every artifact, and the feature store enforces that every named feature has a retention policy of at least ninety days. Rollback CI checks feature availability before flipping traffic. - Experiment tracker as a single point of failure: If MLflow tracking server goes down, training jobs fail — or worse, succeed silently without logging. Solution: MLflow in high-availability mode behind a load balancer with two PostgreSQL writer replicas, and a circuit breaker in the training job wrapper that degrades gracefully to local SQLite logging if the tracking server is unreachable, then syncs on job completion.
- GPU quota starvation: One team submits a sweep of five hundred hyperparameter trials on spot instances. Spot capacity is exhausted. Serving nodes are evicted when AWS reclaims the pool. Solution: hard separation of training and serving node groups with distinct spot pools and interruption handlers. Serving is on on-demand instances only. Training jobs have
priorityClassName: batch-low; the autoscaler will not evict serving pods to accommodate them. - Stale model card in production: A regulatory audit asks for the training data provenance and fairness evaluation for a model that went to production fourteen months ago. The model card was filed but references an S3 path that was cleaned up. Solution: model cards are immutable objects stored in the registry itself (not object storage), with full training dataset SHA hashes and evaluation dataset hashes, and deletion is blocked by a Kubernetes admission webhook while any production deployment references that model version.
Scaling from 30 to 100 Models
The thirty-model platform you design today will stress in predictable ways as it grows. The platform changes that matter most at the hundred-model mark are: (1) a model metadata catalog — a searchable index of all models, their owners, their input/output schemas, their SLOs, and their cost per inference — because finding who owns a model becomes a real operational problem; (2) a self-service training workflow template that data scientists can parameterize without touching YAML, because the MLOps team becomes a bottleneck if every new model requires bespoke pipeline work; and (3) automated cost attribution by model and team exposed as a weekly digest, because GPU costs that were acceptable at thirty models are budget-threatening at one hundred. Design those three capabilities into the initial architecture even if you do not implement them on day one. Retrofitting them onto an existing platform is an order of magnitude more expensive than building the hooks for them upfront.