LLM Operations
LLM Operations
Serving a traditional ML model — a gradient-boosted tree, a small neural net — fits comfortably inside the patterns covered in lessons 6 and 7: containerise the model, expose an HTTP endpoint, scale horizontally on CPU, watch latency and drift. Large language models break every one of those assumptions. A single 70-billion-parameter model in 16-bit precision occupies roughly 140 GB of GPU memory and produces tokens one at a time in an autoregressive loop. The inference path is memory-bandwidth-bound rather than compute-bound. Horizontal scaling means routing tokens across multiple GPU servers, not spinning up more CPU pods. The failure modes are wholly different: GPU out-of-memory (OOM) crashes, KV-cache exhaustion, prompt injection, and runaway context lengths that silently 10× your cost.
This lesson covers how production engineering teams at big-tech companies actually operate LLMs: the GPU memory mathematics you must internalise, the serving stacks in wide deployment, caching strategies that cut cost by 30–70%, and the prompt engineering infrastructure needed to ship reliably across model versions.
GPU Memory Mathematics
Before sizing any GPU cluster, work through the memory budget. The formula for a static model load (parameters only) in bytes is:
model_bytes = num_parameters × bytes_per_parameter
Common precision choices and their costs: FP32 = 4 bytes (training only), BF16/FP16 = 2 bytes (standard inference), INT8 = 1 byte (quantised), INT4 = 0.5 bytes (aggressive quantisation, quality loss). A 70B model in BF16 = 70 × 10⁹ × 2 = 140 GB. A single H100 SXM5 has 80 GB HBM3, so the minimum deployment is a 2-GPU tensor-parallel setup.
But parameter weight is only part of the budget. During inference the KV-cache grows dynamically. Each token in the context window reserves memory for the key and value tensors at every attention layer:
kv_bytes_per_token = 2 × num_layers × num_kv_heads × head_dim × bytes_per_element
For Llama-3-70B (80 layers, 8 GQA heads, head_dim 128, BF16): 2 × 80 × 8 × 128 × 2 = 327,680 bytes ≈ 0.32 MB per token per sequence. At a batch of 32 concurrent sequences with a 4k context each, that is 32 × 4096 × 0.32 MB ≈ 42 GB — on top of the 140 GB model weight. This is why LLM serving is GPU-memory-bound, not compute-bound: you tune serving parameters to fit sequences into the remaining headroom after model load.
gpu_memory_utilization (default 0.90) to control this split.Production Serving Stacks
The two dominant open-source LLM inference engines in production are vLLM and TensorRT-LLM. vLLM introduced PagedAttention — borrowing OS paging concepts to manage the KV-cache in non-contiguous blocks, eliminating internal fragmentation and enabling dynamic batching across variable-length sequences. TensorRT-LLM (NVIDIA) compiles models to CUDA kernels with kernel fusion and FP8 quantisation, extracting the last 20–40% of hardware throughput at the cost of a longer build pipeline.
For a production Kubernetes deployment using vLLM behind the OpenAI-compatible API:
Key flags to understand: --tensor-parallel-size 2 shards each weight matrix across 2 GPUs using NCCL all-reduce on every forward pass — both GPUs must be on the same NVLink fabric (same node). --enable-prefix-caching activates the radix-tree cache that stores computed KV states for repeated prompt prefixes; this is the single most impactful flag for chatbot and RAG workloads where every request shares a long system prompt.
Caching Strategies
LLM inference cost is dominated by GPU-hours. Caching at multiple levels cuts this cost dramatically in production workloads.
KV-cache prefix sharing (in-engine): Enabled with --enable-prefix-caching in vLLM. The engine stores computed KV states for any prompt prefix in a radix tree. A second request sharing the first 2,000 tokens (e.g. the same system prompt + retrieved context) skips the prefill for those tokens entirely — prefill is the expensive phase. At companies running RAG-heavy pipelines, this reduces effective GPU time per request by 40–70%.
Semantic cache (application layer): An exact-match cache (Redis) is useless for natural language because two queries rarely share a byte-for-byte prefix. A semantic cache embeds each incoming query, searches a vector store (Redis with the Vector Similarity Search module, Qdrant, or Pinecone) for a nearest neighbour within a cosine similarity threshold (typically 0.92–0.97), and returns the cached LLM response if the neighbour is close enough. Libraries such as GPTCache and LangChain's semantic cache implement this pattern. At Google and Meta scale, semantic caching on a high-cardinality support-chat product delivers a 30–50% cache hit rate on typical workloads.
Response cache (CDN/Redis): For deterministic, idempotent prompts — report generation, document summarisation with a fixed template — cache the full response keyed on a hash of the exact prompt + model version + sampling parameters. Temperature 0 with a fixed seed makes this safe.
cache_hit_rate metric per cache tier. A semantic cache that fires on 5% of queries but adds 20 ms of embedding overhead on every miss is a net negative. Calibrate similarity thresholds against human-judged answer quality, not just cache hit rate.LLM Serving Architecture
Prompt & Version Management
Prompt engineering at production scale is a software engineering problem, not a research problem. A prompt is code: it has a version, a test suite, a deployment pipeline, and a rollback path. The discipline is called prompt management, and every serious LLM platform implements it.
The minimal viable prompt management system has four components:
- Git-versioned prompt templates — store prompts as files in a repository (e.g.
prompts/support-chat/v3.jinja2). Each commit is a version. Use Jinja2 or Handlebars templates with typed variables so the application layer fills slots, not raw string concatenation. - Prompt registry service — a lightweight API (or a feature of your LLM gateway such as LiteLLM, Portkey, or a home-built service) that serves named prompt versions by tag. Production traffic hits
tag: stable; canary traffic hitstag: canary. The application code references prompt names, not prompt text. - Evaluation pipeline — before promoting a new prompt version, run an automated eval suite: exact-match assertions for deterministic outputs, LLM-as-judge for open-ended quality, latency regression tests. Fail the promotion if quality scores drop below threshold.
- Rollback — because a prompt change is just a tag pointer update, rollback is instant: flip
stableto point at the previous version without touching application code or redeploying pods.
Model version management adds another dimension: when the underlying model changes (e.g. upgrading from Llama-3-70B to Llama-3.1-70B), existing prompts may regress. The pattern is to pin prompt versions to model versions in the registry — support-chat@v3 is certified for llama-3-70b; the canary flag for the new model points at v4 which was re-evaluated against the new model checkpoint. Never silently upgrade the model under a prompt that was not re-evaluated.
Key LLM Observability Metrics
In addition to the standard RED metrics (rate, errors, duration), LLM serving requires a second tier of token-level metrics:
- Tokens per second (TPS): decode throughput; H100 with vLLM on Llama-3-70B delivers ~400–600 tok/s at batch 32. Falling below baseline indicates KV-cache thrashing or thermal throttling.
- Time-to-first-token (TTFT): latency of the prefill phase; dominated by prompt length. P99 TTFT > 3 s is usually unacceptable for interactive use cases.
- KV-cache utilisation: exposed by vLLM's
/metricsendpoint asvllm:gpu_cache_usage_perc. Sustained > 90% causes request queuing; reducemax-model-lenor add GPU capacity. - Token cost per request:
(prompt_tokens + completion_tokens) × cost_per_token. Alert on p95 exceeding budget — runaway agentic loops show up here first.
max_prompt_tokens at the gateway layer, not only inside the model server. Rate-limit by token count, not just by request count. Alert when any single session exceeds a token budget threshold (e.g. 50k tokens in 5 minutes).LLM Operations is the highest-leverage area in AI infrastructure right now. The teams that invest in GPU memory budgeting, multi-level caching, and versioned prompt pipelines — rather than treating LLMs as black-box API calls — are the ones delivering reliable, cost-effective AI features at scale.