MLOps & DevOps for AI Systems

LLM Operations

18 min Lesson 8 of 28

LLM Operations

Serving a traditional ML model — a gradient-boosted tree, a small neural net — fits comfortably inside the patterns covered in lessons 6 and 7: containerise the model, expose an HTTP endpoint, scale horizontally on CPU, watch latency and drift. Large language models break every one of those assumptions. A single 70-billion-parameter model in 16-bit precision occupies roughly 140 GB of GPU memory and produces tokens one at a time in an autoregressive loop. The inference path is memory-bandwidth-bound rather than compute-bound. Horizontal scaling means routing tokens across multiple GPU servers, not spinning up more CPU pods. The failure modes are wholly different: GPU out-of-memory (OOM) crashes, KV-cache exhaustion, prompt injection, and runaway context lengths that silently 10× your cost.

This lesson covers how production engineering teams at big-tech companies actually operate LLMs: the GPU memory mathematics you must internalise, the serving stacks in wide deployment, caching strategies that cut cost by 30–70%, and the prompt engineering infrastructure needed to ship reliably across model versions.

GPU Memory Mathematics

Before sizing any GPU cluster, work through the memory budget. The formula for a static model load (parameters only) in bytes is:

model_bytes = num_parameters × bytes_per_parameter

Common precision choices and their costs: FP32 = 4 bytes (training only), BF16/FP16 = 2 bytes (standard inference), INT8 = 1 byte (quantised), INT4 = 0.5 bytes (aggressive quantisation, quality loss). A 70B model in BF16 = 70 × 10⁹ × 2 = 140 GB. A single H100 SXM5 has 80 GB HBM3, so the minimum deployment is a 2-GPU tensor-parallel setup.

But parameter weight is only part of the budget. During inference the KV-cache grows dynamically. Each token in the context window reserves memory for the key and value tensors at every attention layer:

kv_bytes_per_token = 2 × num_layers × num_kv_heads × head_dim × bytes_per_element

For Llama-3-70B (80 layers, 8 GQA heads, head_dim 128, BF16): 2 × 80 × 8 × 128 × 2 = 327,680 bytes ≈ 0.32 MB per token per sequence. At a batch of 32 concurrent sequences with a 4k context each, that is 32 × 4096 × 0.32 MB ≈ 42 GB — on top of the 140 GB model weight. This is why LLM serving is GPU-memory-bound, not compute-bound: you tune serving parameters to fit sequences into the remaining headroom after model load.

Thumb rule for sizing: reserve ~20% of total GPU memory for the model weight, ~20% for CUDA kernels and framework overhead, and allocate the remaining ~60% to the KV-cache. Divide that by your expected max-context × bytes-per-token to get your maximum concurrent sequences (the effective batch size). vLLM exposes gpu_memory_utilization (default 0.90) to control this split.

Production Serving Stacks

The two dominant open-source LLM inference engines in production are vLLM and TensorRT-LLM. vLLM introduced PagedAttention — borrowing OS paging concepts to manage the KV-cache in non-contiguous blocks, eliminating internal fragmentation and enabling dynamic batching across variable-length sequences. TensorRT-LLM (NVIDIA) compiles models to CUDA kernels with kernel fusion and FP8 quantisation, extracting the last 20–40% of hardware throughput at the cost of a longer build pipeline.

For a production Kubernetes deployment using vLLM behind the OpenAI-compatible API:

# vllm-deployment.yaml — Llama-3-70B on 2x H100 (tensor parallel)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-vllm
  namespace: ml-serving
spec:
  replicas: 1          # scale by adding replicas; each replica owns its GPU pod
  selector:
    matchLabels:
      app: llm-vllm
  template:
    metadata:
      labels:
        app: llm-vllm
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.5.4
          args:
            - "--model"
            - "meta-llama/Meta-Llama-3-70B-Instruct"
            - "--tensor-parallel-size"
            - "2"
            - "--dtype"
            - "bfloat16"
            - "--max-model-len"
            - "8192"
            - "--gpu-memory-utilization"
            - "0.88"
            - "--enable-prefix-caching"    # radix-tree KV cache reuse
            - "--port"
            - "8000"
          resources:
            limits:
              nvidia.com/gpu: 2
              memory: 400Gi
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: hf-model-cache-pvc
      nodeSelector:
        cloud.google.com/gke-accelerator: nvidia-h100-80gb
---
apiVersion: v1
kind: Service
metadata:
  name: llm-vllm
  namespace: ml-serving
spec:
  selector:
    app: llm-vllm
  ports:
    - port: 80
      targetPort: 8000

Key flags to understand: --tensor-parallel-size 2 shards each weight matrix across 2 GPUs using NCCL all-reduce on every forward pass — both GPUs must be on the same NVLink fabric (same node). --enable-prefix-caching activates the radix-tree cache that stores computed KV states for repeated prompt prefixes; this is the single most impactful flag for chatbot and RAG workloads where every request shares a long system prompt.

Caching Strategies

LLM inference cost is dominated by GPU-hours. Caching at multiple levels cuts this cost dramatically in production workloads.

KV-cache prefix sharing (in-engine): Enabled with --enable-prefix-caching in vLLM. The engine stores computed KV states for any prompt prefix in a radix tree. A second request sharing the first 2,000 tokens (e.g. the same system prompt + retrieved context) skips the prefill for those tokens entirely — prefill is the expensive phase. At companies running RAG-heavy pipelines, this reduces effective GPU time per request by 40–70%.

Semantic cache (application layer): An exact-match cache (Redis) is useless for natural language because two queries rarely share a byte-for-byte prefix. A semantic cache embeds each incoming query, searches a vector store (Redis with the Vector Similarity Search module, Qdrant, or Pinecone) for a nearest neighbour within a cosine similarity threshold (typically 0.92–0.97), and returns the cached LLM response if the neighbour is close enough. Libraries such as GPTCache and LangChain's semantic cache implement this pattern. At Google and Meta scale, semantic caching on a high-cardinality support-chat product delivers a 30–50% cache hit rate on typical workloads.

Response cache (CDN/Redis): For deterministic, idempotent prompts — report generation, document summarisation with a fixed template — cache the full response keyed on a hash of the exact prompt + model version + sampling parameters. Temperature 0 with a fixed seed makes this safe.

Measure before optimising: instrument your serving layer to emit a cache_hit_rate metric per cache tier. A semantic cache that fires on 5% of queries but adds 20 ms of embedding overhead on every miss is a net negative. Calibrate similarity thresholds against human-judged answer quality, not just cache hit rate.

LLM Serving Architecture

LLM serving stack: gateway → semantic cache → vLLM scheduler → GPU pods, with a Git-versioned prompt registry feeding the gateway.

Prompt & Version Management

Prompt engineering at production scale is a software engineering problem, not a research problem. A prompt is code: it has a version, a test suite, a deployment pipeline, and a rollback path. The discipline is called prompt management, and every serious LLM platform implements it.

The minimal viable prompt management system has four components:

Git-versioned prompt templates — store prompts as files in a repository (e.g. prompts/support-chat/v3.jinja2). Each commit is a version. Use Jinja2 or Handlebars templates with typed variables so the application layer fills slots, not raw string concatenation.
Prompt registry service — a lightweight API (or a feature of your LLM gateway such as LiteLLM, Portkey, or a home-built service) that serves named prompt versions by tag. Production traffic hits tag: stable; canary traffic hits tag: canary. The application code references prompt names, not prompt text.
Evaluation pipeline — before promoting a new prompt version, run an automated eval suite: exact-match assertions for deterministic outputs, LLM-as-judge for open-ended quality, latency regression tests. Fail the promotion if quality scores drop below threshold.
Rollback — because a prompt change is just a tag pointer update, rollback is instant: flip stable to point at the previous version without touching application code or redeploying pods.

# Minimal prompt registry in Python (FastAPI) — production teams use LiteLLM
# or Langfuse for this, but the pattern is the same.

# prompts/support-chat/v3.jinja2
# ---
# You are a support agent for {{ company_name }}.
# Answer in {{ language }}. Tone: {{ tone }}.
# Context: {{ context }}
# ---

# prompt_registry.py
import yaml, jinja2
from pathlib import Path
from fastapi import FastAPI, HTTPException

app = FastAPI()
PROMPT_DIR = Path("prompts")

# tags.yaml (git-committed, updated by CI)
# support-chat:
#   stable: v3
#   canary: v4

def load_tags():
    with open(PROMPT_DIR / "tags.yaml") as f:
        return yaml.safe_load(f)

@app.get("/prompt/{name}")
def get_prompt(name: str, tag: str = "stable"):
    tags = load_tags()
    if name not in tags:
        raise HTTPException(404, "prompt not found")
    version = tags[name].get(tag, tags[name]["stable"])
    template_path = PROMPT_DIR / name / f"{version}.jinja2"
    if not template_path.exists():
        raise HTTPException(404, "version not found")
    return {"name": name, "version": version, "template": template_path.read_text()}

# Render at call site (application code):
# resp = requests.get("http://prompt-registry/prompt/support-chat?tag=stable")
# tmpl = jinja2.Template(resp.json()["template"])
# prompt = tmpl.render(company_name="Acme", language="en", tone="professional", context=ctx)

Model version management adds another dimension: when the underlying model changes (e.g. upgrading from Llama-3-70B to Llama-3.1-70B), existing prompts may regress. The pattern is to pin prompt versions to model versions in the registry — support-chat@v3 is certified for llama-3-70b; the canary flag for the new model points at v4 which was re-evaluated against the new model checkpoint. Never silently upgrade the model under a prompt that was not re-evaluated.

Key LLM Observability Metrics

In addition to the standard RED metrics (rate, errors, duration), LLM serving requires a second tier of token-level metrics:

Tokens per second (TPS): decode throughput; H100 with vLLM on Llama-3-70B delivers ~400–600 tok/s at batch 32. Falling below baseline indicates KV-cache thrashing or thermal throttling.
Time-to-first-token (TTFT): latency of the prefill phase; dominated by prompt length. P99 TTFT > 3 s is usually unacceptable for interactive use cases.
KV-cache utilisation: exposed by vLLM's /metrics endpoint as vllm:gpu_cache_usage_perc. Sustained > 90% causes request queuing; reduce max-model-len or add GPU capacity.
Token cost per request: (prompt_tokens + completion_tokens) × cost_per_token. Alert on p95 exceeding budget — runaway agentic loops show up here first.

Runaway context attacks: an adversarial user (or a looping agent) that sends a 128k-token prompt on every request can exhaust your KV-cache and starve all other traffic. Enforce max_prompt_tokens at the gateway layer, not only inside the model server. Rate-limit by token count, not just by request count. Alert when any single session exceeds a token budget threshold (e.g. 50k tokens in 5 minutes).

LLM Operations is the highest-leverage area in AI infrastructure right now. The teams that invest in GPU memory budgeting, multi-level caching, and versioned prompt pipelines — rather than treating LLMs as black-box API calls — are the ones delivering reliable, cost-effective AI features at scale.