Networking & Communication

DNS & How Names Resolve

18 min Lesson 2 of 10

DNS & How Names Resolve

Every time a user types https://api.example.com into their browser or an inter-service call hits a hostname, a small but critical piece of infrastructure springs into action: the Domain Name System (DNS). DNS is the internet's distributed phonebook — it translates human-readable names into the IP addresses that routers and servers actually understand. For a system designer, DNS is not just plumbing; it is a lever for routing traffic, enforcing failover, load-balancing globally, and shaping latency.

The DNS Hierarchy

DNS is a hierarchical, delegated namespace organized as a tree. At the root sits the root zone (the single dot .), maintained by ICANN. Below it are the Top-Level Domains (TLDs) — .com, .org, .io, country codes like .uk. Each TLD delegates authority for second-level domains to authoritative name servers operated by registrars or domain owners. For example, the owner of example.com runs (or points to) authoritative servers that know the real IP for every subdomain under that domain.

There is no single server that knows everything. The genius of DNS is that every level only knows who to ask next — a model called iterative resolution.

The Full Resolution Walk-Through

When a client needs to resolve api.example.com, the following chain fires:

Browser / OS cache — the OS checks its local DNS cache (and the browser's own cache). If a fresh record exists, the answer is returned immediately — zero network hops.
Stub resolver → Recursive resolver — if there is no cache hit, the OS stub resolver queries the recursive resolver configured on the machine (e.g. 8.8.8.8 for Google Public DNS or the ISP's server). The recursive resolver does all the hard work on behalf of the client.
Recursive resolver → Root servers — the recursive resolver asks one of the 13 logical root server clusters: "Who handles .com?" The root replies with the address of the .com TLD name servers.
Recursive resolver → TLD name servers — the resolver asks the .com TLD servers: "Who is authoritative for example.com?" They respond with the authoritative name server (NS) records for example.com.
Recursive resolver → Authoritative name server — the resolver queries ns1.example.com: "What is the IP for api.example.com?" The authoritative server returns the A record (IPv4) or AAAA record (IPv6).
Answer returned to client — the recursive resolver caches the result and forwards the IP back to the stub resolver, which hands it to the application. The browser can now open a TCP connection.

In practice, the recursive resolver caches answers at every step. On a warm cache, only step 2 (client → recursive resolver) fires — a round-trip of a few milliseconds. A fully cold lookup through all four tiers adds 50–150 ms of latency on a typical internet connection, which is why caching is so important.

The six-step iterative DNS resolution chain — from client cache check to authoritative name server and back.

Key DNS Record Types

A — maps a hostname to an IPv4 address. The most common record type.
AAAA — maps a hostname to an IPv6 address.
CNAME — canonical name; aliases one hostname to another. Cannot be used at the zone apex (root domain).
NS — name server; delegates a zone to specific servers.
MX — mail exchange; directs email traffic to mail servers with priority ordering.
TXT — arbitrary text; used for SPF, DKIM, domain verification, and service discovery.
SRV — service locator; specifies host, port, and priority for a specific protocol/service. Used heavily by Kubernetes and service meshes.
ALIAS / ANAME — a non-standard, provider-specific record that behaves like a CNAME but can be used at the zone apex. AWS Route 53 calls this an Alias record.

TTL and Caching

Every DNS record carries a Time-To-Live (TTL) — the number of seconds a resolver is allowed to cache that answer. TTL is the most important tuning knob system designers control:

Low TTL (30–300 s) — changes propagate quickly. Use this during migrations, blue/green deploys, or before a planned failover. The cost is more DNS queries reaching your authoritative servers.
High TTL (3,600–86,400 s) — reduces recursive lookups and latency. Safe for stable records that never change. Cloudflare and AWS Route 53 recommend 300 s as a sensible default for most production records.

The TTL propagation trap: if you need to do an emergency failover but your TTL is 86,400 s (24 hours), some resolvers will serve the old IP for up to 24 hours regardless of what you put in DNS. The common practice is to lower the TTL days before any planned change, wait one full TTL cycle, make the change, then raise the TTL again.

DNS Caching Layers

Caching happens at multiple levels, each with its own TTL horizon:

Browser cache — Chrome caches DNS for 60 s; Firefox for 60 s; Safari varies. Entirely outside your control.
OS resolver cache — nscd, systemd-resolved, or macOS's mDNSResponder maintains a per-machine cache. Controlled by the OS.
Recursive resolver cache — the ISP or public resolver (Google, Cloudflare) caches answers for all their users. Records with a 300 s TTL get cached across millions of clients here.
Authoritative server cache — when an authoritative server also acts as a secondary or slave, it caches zone transfers.

Negative caching (NXDOMAIN): resolvers also cache negative answers — "this name does not exist." The TTL for a negative response is set in the SOA record's MINIMUM field (RFC 2308). If you deploy a new subdomain and it does not appear, a resolver may have cached the NXDOMAIN for up to 5 minutes.

DNS in System Design Decisions

DNS is not just a lookup tool — it is an active traffic-management layer:

GeoDNS / Latency-based routing — authoritative servers can return different IPs based on the resolver's location. AWS Route 53 "Latency Routing" and Cloudflare's Anycast network use this to direct users to the nearest region automatically.
Weighted routing — return IP-A 90 % of the time and IP-B 10 % to do a gradual canary release at the DNS level.
Health-check failover — Route 53, Cloudflare, and NS1 all support active health checks. If the primary endpoint fails, the authoritative server automatically stops returning that IP — DNS-level failover without touching code.
Service discovery in microservices — Kubernetes uses an internal DNS server (CoreDNS) to resolve service names like payments-service.default.svc.cluster.local to virtual IPs. Each pod gets DNS-based service discovery for free.

GeoDNS: the same hostname returns different IP addresses depending on where the recursive resolver is located, routing users to the nearest origin.

Common Pitfalls for System Designers

Hardcoding IPs — services that bypass DNS and hardcode IP addresses lose all the failover, routing, and load-balancing benefits DNS provides. Always resolve by hostname.
Ignoring TTL before migrations — as covered above, always lower TTL well in advance of planned IP changes.
Single point of failure in NS records — always configure at least two authoritative name servers in different networks. ICANN requires it; best practice is four.
Not accounting for DNS latency in SLA calculations — a cold DNS lookup adds 50–150 ms to the first connection. Connection pools and keep-alives eliminate repeat lookups, but first-request latency matters for SLAs.
CNAME chains — a CNAME pointing to another CNAME pointing to another costs an extra lookup per hop. Some resolvers impose a limit (typically 8 hops). Keep chains short.

Pre-warm DNS in your health-check strategy: when spinning up a new region during auto-scaling or a failover event, resolve the DNS records from within that region before routing real traffic there. This ensures the local OS and container caches are warm and the first real request does not pay a cold-lookup penalty on top of a cold application startup.

Summary

DNS resolution is a six-tier delegated lookup that normally completes in under 5 ms on a warm cache but can add 150 ms on a fully cold path. TTL is the primary tuning parameter — lower it before changes, raise it for stability. At scale, DNS becomes a first-class traffic-engineering tool: GeoDNS, weighted routing, and health-check failover all operate at this layer. Understanding the caching hierarchy and propagation delays is essential whenever you design a migration, a multi-region deployment, or a zero-downtime release strategy.