MLOps & DevOps for AI Systems

MLOps: DevOps Meets Machine Learning

18 min Lesson 1 of 28

MLOps: DevOps Meets Machine Learning

You have spent the previous 48 tutorials building the disciplines that make software systems production-grade: CI/CD pipelines, container orchestration, GitOps, observability stacks, SRE practices, and security postures. Every principle you internalized — deploy small, automate everything, measure before and after, own your incidents — still applies here. But machine learning systems impose a set of additional concerns that traditional DevOps tooling was not designed to handle.

MLOps is the operational discipline that extends DevOps to cover the full lifecycle of machine learning models: data ingestion, feature engineering, experiment tracking, model training, evaluation, deployment, and ongoing monitoring. At top-tier companies (Google, Meta, Amazon, Uber), dedicated ML platform teams of 10–50 engineers build and operate internal MLOps infrastructure that their data science and ML engineering organizations depend on. Understanding why that infrastructure exists — and what would break without it — is the foundation of this tutorial.

How ML Systems Differ from Conventional Software

A conventional service has a determinate relationship between code and behavior: change the code, change the behavior, ship it, verify it. An ML system introduces a third artifact — data — that is equally load-bearing but far harder to version, test, and govern.

The differences manifest in four areas that DevOps pipelines have no native answer for:

Data dependencies: Model behavior is determined not just by code but by the training dataset. Two identical codebases trained on different datasets produce entirely different models. The "source of truth" now includes tens of terabytes of structured and unstructured data, upstream pipeline definitions, feature transformation logic, and schema contracts between teams. A bug in a feature pipeline can silently corrupt a model months before anyone notices.
Experiments, not deployments: Software engineers open a PR and merge a change. ML engineers run hundreds or thousands of training experiments — varying hyperparameters, architectures, feature sets, and loss functions — and select the best result. Without systematic tracking, you cannot reproduce a result from last Tuesday, explain why the model that went to production was chosen, or audit a regulatory decision. This is the experiment-tracking problem.
Non-determinism: Two training runs with the same code and the same data, on the same hardware, can produce models with measurably different accuracy because of random weight initialization, non-deterministic GPU kernels, and floating-point arithmetic differences across hardware generations. "Works on my machine" has a whole new meaning when the artifact is a 7-billion-parameter weight tensor.
Model drift: Deployed software does not degrade on its own. A model deployed in January will become less accurate over time if the real-world distribution of inputs shifts away from the distribution of the training data. A fraud detection model trained on 2023 transaction patterns will start missing novel 2024 fraud techniques within weeks. This is concept drift — and it has no analogue in traditional software operations. There is no alarm that fires automatically; you must build the monitoring.

Key idea: In conventional software, a passing test suite is strong evidence that a build is safe to ship. In ML, a passing unit test suite tells you almost nothing about whether the model is safe to serve — you need data quality checks, training metrics, evaluation on held-out test sets, and production distribution monitoring layered on top of your existing CI/CD gates.

The ML Lifecycle — Diagrammed

The ML lifecycle is not a pipeline in the traditional sense; it is a loop with feedback cycles that can reset or branch at any stage. Understanding where each tool category fits in this loop is the prerequisite for the rest of this tutorial series.

The ML system lifecycle: a feedback loop across six stages, with drift detection triggering automated retraining.

Data: The First-Class Artifact

In a well-run ML organization, data is versioned, tested, and governed with the same rigor applied to code. The foundational tools are a feature store (a centralized repository of computed features that can be shared across models and served at low latency) and a data pipeline that can be reproduced exactly. Uber's Michelangelo platform — the earliest large-scale internal MLOps system published in detail — identifies "data management" as the most error-prone layer. The common failure mode: a feature is computed differently in training (batch SQL on historical data) than in serving (real-time application code), producing a silent distributional mismatch known as training-serving skew. This is responsible for a significant fraction of degraded-model incidents at scale companies.

# Data version control with DVC (Data Version Control) — treats datasets like Git treats code.
# Initialize alongside your Git repo:
pip install dvc dvc-s3

git init
dvc init

# Track a large training dataset stored in S3:
dvc remote add -d myremote s3://my-ml-bucket/dvc-store
dvc add data/training_features.parquet
git add data/training_features.parquet.dvc .gitignore
git commit -m "track training features v1"
dvc push

# Reproduce exact dataset state on any machine or in CI:
dvc pull
# The .dvc file is committed to Git — just like a lockfile.
# dvc pull fetches the exact content-addressed blob from S3.

Experiments: The Gap Between Research and Engineering

Every ML project involves a period of experimentation: which architecture performs best, which features matter, what regularization prevents overfitting on the validation set. This is necessary and healthy. The operational failure mode is running this experimentation informally — notebooks on a local laptop, metrics printed to stdout, model weights saved to model_v3_final_FINAL.pkl. When the time comes to reproduce the experiment that produced the best model, you cannot, because the exact hyperparameters, data version, random seed, and library versions were never recorded.

Experiment tracking tools — MLflow (open-source, self-hosted), Weights and Biases (managed SaaS), and Neptune — solve this by automatically logging every training run's parameters, metrics, artifacts, and environment into a queryable database. The output is a model registry entry that links the deployed artifact back to the exact experiment that produced it.

# MLflow experiment tracking — minimal instrumentation in a training script:
import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier

mlflow.set_experiment("fraud-detection-v2")

with mlflow.start_run():
    params = {"n_estimators": 200, "max_depth": 5, "learning_rate": 0.05}
    mlflow.log_params(params)

    model = GradientBoostingClassifier(**params)
    model.fit(X_train, y_train)

    auc = roc_auc_score(y_val, model.predict_proba(X_val)[:, 1])
    mlflow.log_metric("val_auc", auc)

    # Log the model artifact with input/output schema
    mlflow.sklearn.log_model(model, "model",
        registered_model_name="fraud-detection")

# Every run is now queryable: who ran it, when, on which data version,
# with what params, producing what metrics, saving what artifact.

Drift: The Silent Degradation

This is the concern with no conventional DevOps parallel. After you deploy a model, the world keeps changing. Data drift (also called covariate shift) means the statistical distribution of input features has shifted — a recommendation model trained before a major product redesign sees inputs it was never trained on. Concept drift means the relationship between inputs and the correct output has changed — a churn prediction model trained in 2023 may no longer reflect current user behavior patterns.

Neither of these fires an alert in your Prometheus stack. The model continues to accept requests and return predictions — it just returns increasingly wrong predictions. Detecting drift requires comparing live traffic statistics against training-time statistics on a continuous basis, and defining alert thresholds that trigger investigation or automated retraining.

At Google, the ML infrastructure paper "Machine Learning: The High-Interest Credit Card of Technical Debt" (Sculley et al., 2015) identifies drift monitoring as one of the most underinvested areas in ML systems. The paper's core insight: ML systems accrue technical debt in ways that are invisible to conventional code review. The only defense is operational instrumentation.

Production practice: Define a model health SLO alongside your service SLO. For a fraud detection model: "precision on the daily fraud cohort must not drop below 0.82." Wire that metric to your alerting stack — the same PagerDuty runbook that handles a latency SLO breach should handle a model precision breach. Treat model degradation as an incident, not a data science problem.

The MLOps Maturity Levels

Google's MLOps whitepaper defines three maturity levels, and knowing where you are determines which tooling investments pay off first:

Level 0 — Manual process: Data scientists train models in notebooks, hand off artifacts to ops teams who deploy them as static binaries. Retraining happens reactively, months after drift is noticed. The majority of companies operating ML in production are here.
Level 1 — ML pipeline automation: Training is a parameterized pipeline that runs automatically when new data arrives. Experiment tracking is in place. The feature store and model registry exist. Models are retrained on a schedule or on a drift trigger.
Level 2 — CI/CD for ML pipelines: The training pipeline itself is code-reviewed, unit-tested, and deployed via a CI/CD system. Adding a new feature or changing a model architecture triggers a full automated pipeline run including evaluation gates and staged rollout. This is the standard at Google, Uber, and Airbnb for their critical ML systems.

Production pitfall: The most common MLOps mistake at Level 0-to-1 transitions is instrumenting experiment tracking before fixing training-serving skew. You can track a thousand experiments with perfect fidelity, but if your serving pipeline computes features differently than your training pipeline, every deployed model is subtly broken. Fix data pipeline consistency first; add experiment tracking second.

Connecting to Your Existing DevOps Stack

MLOps is not a replacement for DevOps — it is a layer on top of it. The Kubernetes cluster you operate is the compute plane for training jobs (via Kubeflow or Argo Workflows) and for model serving (via KServe). The Prometheus and Grafana observability stack you run monitors model serving latency and throughput. The GitOps workflow you use for application deployments applies equally to ML pipeline definitions stored as code. The security and secrets management practices you have already established protect training data access credentials and model artifact signing.

The next nine lessons in this tutorial drill into each stage of the lifecycle in depth: data and feature pipelines, experiment tracking, GPU training infrastructure, CI/CD for models, serving patterns, production monitoring, LLM operations, cost governance, and a capstone ML platform design. Each lesson builds on one component of the lifecycle diagram above — and builds on the DevOps infrastructure you already know how to operate.