DNS & How Names Resolve
DNS & How Names Resolve
Every time a user types https://api.example.com into their browser or an inter-service call hits a hostname, a small but critical piece of infrastructure springs into action: the Domain Name System (DNS). DNS is the internet's distributed phonebook — it translates human-readable names into the IP addresses that routers and servers actually understand. For a system designer, DNS is not just plumbing; it is a lever for routing traffic, enforcing failover, load-balancing globally, and shaping latency.
The DNS Hierarchy
DNS is a hierarchical, delegated namespace organized as a tree. At the root sits the root zone (the single dot .), maintained by ICANN. Below it are the Top-Level Domains (TLDs) — .com, .org, .io, country codes like .uk. Each TLD delegates authority for second-level domains to authoritative name servers operated by registrars or domain owners. For example, the owner of example.com runs (or points to) authoritative servers that know the real IP for every subdomain under that domain.
There is no single server that knows everything. The genius of DNS is that every level only knows who to ask next — a model called iterative resolution.
The Full Resolution Walk-Through
When a client needs to resolve api.example.com, the following chain fires:
- Browser / OS cache — the OS checks its local DNS cache (and the browser's own cache). If a fresh record exists, the answer is returned immediately — zero network hops.
- Stub resolver → Recursive resolver — if there is no cache hit, the OS stub resolver queries the recursive resolver configured on the machine (e.g.
8.8.8.8for Google Public DNS or the ISP's server). The recursive resolver does all the hard work on behalf of the client. - Recursive resolver → Root servers — the recursive resolver asks one of the 13 logical root server clusters: "Who handles
.com?" The root replies with the address of the.comTLD name servers. - Recursive resolver → TLD name servers — the resolver asks the
.comTLD servers: "Who is authoritative forexample.com?" They respond with the authoritative name server (NS) records forexample.com. - Recursive resolver → Authoritative name server — the resolver queries
ns1.example.com: "What is the IP forapi.example.com?" The authoritative server returns theArecord (IPv4) orAAAArecord (IPv6). - Answer returned to client — the recursive resolver caches the result and forwards the IP back to the stub resolver, which hands it to the application. The browser can now open a TCP connection.
In practice, the recursive resolver caches answers at every step. On a warm cache, only step 2 (client → recursive resolver) fires — a round-trip of a few milliseconds. A fully cold lookup through all four tiers adds 50–150 ms of latency on a typical internet connection, which is why caching is so important.
Key DNS Record Types
- A — maps a hostname to an IPv4 address. The most common record type.
- AAAA — maps a hostname to an IPv6 address.
- CNAME — canonical name; aliases one hostname to another. Cannot be used at the zone apex (root domain).
- NS — name server; delegates a zone to specific servers.
- MX — mail exchange; directs email traffic to mail servers with priority ordering.
- TXT — arbitrary text; used for SPF, DKIM, domain verification, and service discovery.
- SRV — service locator; specifies host, port, and priority for a specific protocol/service. Used heavily by Kubernetes and service meshes.
- ALIAS / ANAME — a non-standard, provider-specific record that behaves like a CNAME but can be used at the zone apex. AWS Route 53 calls this an Alias record.
TTL and Caching
Every DNS record carries a Time-To-Live (TTL) — the number of seconds a resolver is allowed to cache that answer. TTL is the most important tuning knob system designers control:
- Low TTL (30–300 s) — changes propagate quickly. Use this during migrations, blue/green deploys, or before a planned failover. The cost is more DNS queries reaching your authoritative servers.
- High TTL (3,600–86,400 s) — reduces recursive lookups and latency. Safe for stable records that never change. Cloudflare and AWS Route 53 recommend 300 s as a sensible default for most production records.
DNS Caching Layers
Caching happens at multiple levels, each with its own TTL horizon:
- Browser cache — Chrome caches DNS for 60 s; Firefox for 60 s; Safari varies. Entirely outside your control.
- OS resolver cache —
nscd,systemd-resolved, or macOS'smDNSRespondermaintains a per-machine cache. Controlled by the OS. - Recursive resolver cache — the ISP or public resolver (Google, Cloudflare) caches answers for all their users. Records with a 300 s TTL get cached across millions of clients here.
- Authoritative server cache — when an authoritative server also acts as a secondary or slave, it caches zone transfers.
MINIMUM field (RFC 2308). If you deploy a new subdomain and it does not appear, a resolver may have cached the NXDOMAIN for up to 5 minutes.
DNS in System Design Decisions
DNS is not just a lookup tool — it is an active traffic-management layer:
- GeoDNS / Latency-based routing — authoritative servers can return different IPs based on the resolver's location. AWS Route 53 "Latency Routing" and Cloudflare's Anycast network use this to direct users to the nearest region automatically.
- Weighted routing — return IP-A 90 % of the time and IP-B 10 % to do a gradual canary release at the DNS level.
- Health-check failover — Route 53, Cloudflare, and NS1 all support active health checks. If the primary endpoint fails, the authoritative server automatically stops returning that IP — DNS-level failover without touching code.
- Service discovery in microservices — Kubernetes uses an internal DNS server (CoreDNS) to resolve service names like
payments-service.default.svc.cluster.localto virtual IPs. Each pod gets DNS-based service discovery for free.
Common Pitfalls for System Designers
- Hardcoding IPs — services that bypass DNS and hardcode IP addresses lose all the failover, routing, and load-balancing benefits DNS provides. Always resolve by hostname.
- Ignoring TTL before migrations — as covered above, always lower TTL well in advance of planned IP changes.
- Single point of failure in NS records — always configure at least two authoritative name servers in different networks. ICANN requires it; best practice is four.
- Not accounting for DNS latency in SLA calculations — a cold DNS lookup adds 50–150 ms to the first connection. Connection pools and keep-alives eliminate repeat lookups, but first-request latency matters for SLAs.
- CNAME chains — a CNAME pointing to another CNAME pointing to another costs an extra lookup per hop. Some resolvers impose a limit (typically 8 hops). Keep chains short.
Summary
DNS resolution is a six-tier delegated lookup that normally completes in under 5 ms on a warm cache but can add 150 ms on a fully cold path. TTL is the primary tuning parameter — lower it before changes, raise it for stability. At scale, DNS becomes a first-class traffic-engineering tool: GeoDNS, weighted routing, and health-check failover all operate at this layer. Understanding the caching hierarchy and propagation delays is essential whenever you design a migration, a multi-region deployment, or a zero-downtime release strategy.