The Complete Platform & Your Career
The Complete Platform & Your Career
You have built every layer of the Arctiq Commerce platform across the previous nine lessons. Now step back and see the entire system as one coherent architecture — end to end, from a user's HTTPS request to the database write, through the observability pipeline, enforced by security policy, protected by SRE practice, and governed by cost controls. Then we address the forward-looking question that matters most at this stage: how do you turn this project-level depth into a senior DevOps or SRE career?
The Assembled Platform — End-to-End Request Flow
Every architectural decision you made traces through a real request. A user on a mobile device in Frankfurt places an order. Here is the complete path, and where each platform layer intervenes:
- DNS & edge: Route 53 Latency routing resolves
api.arctiq.comto theeu-west-1CloudFront distribution. CloudFront checks the WAF rule group (rate limit, OWASP top-10, IP reputation list from Lesson 7). The CDN origin is the ALB ineu-west-1— the EU region is now fully active, not standby, after the active-active promotion you completed in Lesson 8. - Ingress & service mesh: The request hits the NLB, then the Istio ingress gateway. Istio terminates TLS, validates the JWT (via the
RequestAuthenticationpolicy), and routes to theordersservice in theteam-commercenamespace. Istio's Envoy sidecar injects atraceparentheader (W3C trace context) — this is the trace root that propagates through every downstream call. - Application compute: Karpenter has scheduled the
orderspod on ac6g.2xlargeSpot instance in theeu-west-1bAZ. The pod's service account uses IRSA — no static credentials; the AWS SDK reads a projected token and exchanges it with STS for short-lived credentials scoped to theorders-service-role. - Secrets: At pod startup, the Vault Agent injected the database DSN and the Stripe API key into an in-memory tmpfs volume at
/vault/secrets/. The application reads from the file — it has never seen a static secret. - Data write: The order is written to Aurora PostgreSQL (the
eu-west-1read replica, which was promoted to writer during the DR exercise in Lesson 8). The Aurora Global Database replicates the write tous-east-1with typical lag of 80–120 ms. Anorder.placedevent is published to the MSK Kafka topicorders-v2. MirrorMaker2 replicates that topic to the US cluster asynchronously. - Observability: The Envoy sidecar reports span data to the OpenTelemetry Collector DaemonSet. The Collector batches and sends traces to Jaeger, metrics to Prometheus (via remote_write to Thanos), and structured logs to Loki. The entire order flow — ingress latency, database write duration, Kafka produce latency — appears as a single flame graph in Grafana within 15 seconds of the request completing.
- Policy enforcement: Falco is watching the
orderspod. If the process attempts afork/execoutside the allowed list (defined in the custom Falco rule from Lesson 7), an alert fires to the security Slack channel and OPA blocks any subsequent attempt to create an exec session into that pod. - Cost accounting: The EC2 instance running the pod carries the Kubernetes cluster tags propagated by Kubecost:
team=commerce,service=orders,env=prod. The cost for this request — compute, data transfer, ALB LCUs — is automatically attributed to the commerce team's monthly budget in the Kubecost dashboard.
orders service. They write business logic, push to GitHub, and within 7 minutes their code is running in production across two regions, traced, alerted on, and cost-attributed — without opening a single ticket to the platform team. That invisibility is the measure of a mature platform.The Complete Architecture Diagram
Platform Health Validation — The Smoke Test Suite
Every production deploy should end with a programmatic validation that confirms the platform is functioning end-to-end, not just that the pods are Running. The following k6 script is the canonical smoke test run by ArgoCD's PostSync hook after every GitOps sync wave completes.
thresholds block is the gate: if p(99) > 800ms, the ArgoCD sync wave is marked failed and Argo Rollouts triggers automatic rollback.Key Platform Metrics — What a Staff Engineer Tracks Weekly
The platform team owns four dashboards, each updated weekly in the engineering all-hands. These are not vanity metrics — they are leading indicators of platform health that predict incidents before they happen:
- Deploy frequency and lead time: Target is >20 deploys/day across all 12 teams, median lead time <10 minutes. A lead time spike above 20 minutes usually means a flaky test in CI, not a Kubernetes problem — the dashboard exposes this. Current: 34 deploys/day, median 6.8 min.
- Error budget burn rate: The 99.95% SLO has a quarterly error budget of 131 minutes. A 6x burn rate (consuming in 1/6th the time) triggers a freeze on risky changes. Track this as a Grafana alert, not a monthly report. Current: 1.2x — healthy.
- Pod eviction rate: Karpenter Spot interruptions cause evictions. More than 3% eviction rate per day signals that the On-Demand fallback capacity ratio needs adjusting, or that a workload is missing a
PodDisruptionBudget. Current: 0.8%. - Secrets rotation lag: Vault dynamic credentials rotate every 15 minutes for the database DSN. If any workload holds a lease older than 30 minutes, it appears in the Vault audit log as a violation. Track as a Prometheus gauge scraped from the Vault metrics endpoint. Current: 0 violations.
- Cost per 1,000 orders: The business metric that connects platform efficiency to revenue. Kubecost computes this by joining K8s cost data with order throughput from the application metrics. Trend: $0.31/1k orders, target <$0.40. Spot usage saves ~$21k/month vs on-demand.
What the Capstone Proved — and What It Did Not
Being honest about the limits of this capstone is itself a senior-engineering skill. What you have built is a complete, production-grade architecture for a company at the 2–20 million user scale. What you have not built, and what would come next at a real company, includes:
- Multi-tenancy isolation: Arctiq runs 12 teams in a single EKS cluster. At 50+ teams, namespace-level isolation begins to break down (noisy neighbor on the API server, overly wide RBAC, etcd pressure). The next evolution is dedicated clusters per business unit with a fleet management layer (Cluster API or EKS Blueprints at the account level).
- Global distributed tracing at billion-request scale: Jaeger with in-memory storage works at our scale. At 1 billion requests/day, you need tail-based sampling (Tempo or Honeycomb), a columnar trace store (ClickHouse or Parquet on S3), and probabilistic sampling policies that guarantee 100% sampling for error traces and 1% for healthy ones.
- FinOps maturity: Kubecost gives you per-team cost visibility. True FinOps at a large company adds unit economics (cost per API call, cost per active user), anomaly detection on spend (ML-based, not threshold-based), and engineer-facing cost nudges in the PR pipeline ("this change will increase infra cost by $400/month").
Your Career Path — The Three Trajectories
Completing a capstone like this positions you at the senior engineer level. The natural next steps diverge into three trajectories, and choosing consciously between them matters more than most engineers realize:
- Staff / Principal Platform Engineer: You own a platform used by hundreds of engineers. Your output is multiplied through others — you write the golden-path templates, the architectural patterns, the internal standards. The skills that compound: technical writing, system design at scale, organizational influence without authority, and the discipline to say no to requests that increase operational burden without a commensurate reliability gain.
- SRE / Production Engineering: You own the reliability of large-scale systems — typically 1M+ RPS, complex failure domains, and SLOs that real customers feel. The skills that compound: statistical analysis of reliability data, deep knowledge of kernel-level performance (eBPF, perf flamegraphs, latency histograms), incident command, and the ability to translate reliability risk into business risk that an executive can act on.
- Engineering Manager / Director of Platform: You own the team that builds the platform. Your output is the output of 8–20 engineers. The skills that compound: hiring for judgment not credentials, creating a team culture that treats operational toil as a first-class engineering problem, communicating platform value to non-technical stakeholders, and defining a multi-year platform roadmap that stays relevant as the product changes.
The Senior DevOps Interview — What Is Actually Tested
Technical interviews for senior DevOps and SRE roles at top companies have shifted significantly. The commodity questions ("what is a pod?", "explain blue-green deployments") are screened out in the initial filter. The interviews that matter test four things:
- System design under ambiguity: You will be given a vague prompt like "design the deployment system for a company with 500 microservices." The interviewer is testing whether you ask clarifying questions (deployment frequency? team structure? tolerance for complexity?) before drawing boxes. The capstone taught you to start with requirements, not solutions.
- Incident analysis: You will be given a graph or a log snippet and asked to diagnose a production incident. Typical scenario: latency p99 spiked to 4 seconds at 14:23 but p50 was unchanged — what do you look at first? (Answer: tail latency with stable median suggests a single slow downstream, not a cluster-wide problem — check distributed traces for the slowest 1% of requests, look at GC pause metrics and connection pool saturation.)
- Trade-off reasoning: "Should we use Istio or Linkerd for our service mesh?" There is no right answer — there are trade-offs. Istio's broader feature set costs more CPU/memory and operational complexity. Linkerd's Rust data plane is faster and simpler but has less ecosystem tooling. The interviewer is testing whether you can articulate trade-offs clearly, not whether you have memorized a winner.
- Production failure modes: "What breaks first when your EKS cluster scales from 500 to 5,000 nodes?" The answer involves etcd write throughput, API server request rate limits, CoreDNS NXDomain flood from misconfigured
ndots:5, and the Kubernetes scheduler's default pod QPS ceiling. These are not things you learn from tutorials — they come from running production systems at scale, or from studying post-mortems from companies that have.
Closing: What This Course Actually Taught You
This course was never really about Terraform syntax or Kubernetes YAML. Those are implementation details that change with every major version. What the course built, across 50 tutorials and this capstone, is a mental model for how production systems fail and how resilient platforms are designed to fail gracefully. The four mental models that will serve you for the next decade:
- Defense in depth: No single control is enough. Every layer — network, admission control, runtime security, secret rotation, alerting — exists because the layer above it will eventually fail. You do not trust the WAF to block everything, so you have Istio mTLS. You do not trust mTLS alone, so you have OPA. You do not trust OPA alone, so you have Falco. The goal is not zero breaches — it is detecting and containing a breach before it becomes a catastrophe.
- Observability-first design: A system you cannot observe is a system you cannot understand and cannot improve. Every service you deploy should be instrumentable before it is deployed, not instrumented after the first incident. SLOs are not monitoring configuration — they are a contract between the platform and the business.
- Toil is a system design failure: Every time an engineer has to do something manually that could be automated, that is a bug in the platform, not a feature of the role. The greatest leverage a platform team has is to eliminate entire categories of manual work — not to do the manual work faster.
- Requirements drive architecture: Every tool, every service, every layer in the platform exists because of a specific requirement with a number attached to it. If you cannot point to the requirement that justifies a piece of complexity, that piece of complexity should not exist.
The platform you built in this capstone handles the Arctiq Commerce requirements. The judgment you developed building it handles requirements you have not seen yet. That is what a senior DevOps engineering career is built on — not the tools, but the judgment to choose the right tools for the constraints in front of you.