Reliability, Availability & Resilience

Timeouts, Retries & Backoff

18 min Lesson 6 of 10

Timeouts, Retries & Backoff

Every distributed system makes remote calls — to a database, a downstream microservice, a third-party API. Any of those calls can hang, fail, or return slowly. Without deliberate guardrails, a single slow dependency cascades into a full-system outage. Timeouts, retries, and exponential backoff are the three foundational tools that keep a remote call from becoming a crisis.

Why Timeouts Are Non-Negotiable

A timeout is a deadline: if the remote peer has not responded within N milliseconds, abort the call and free the thread, connection, and memory that were waiting. Without timeouts:

Threads pile up waiting for a response that may never come.
Connection pools exhaust, and new requests queue or fail immediately.
A single slow dependency drags every service that calls it into the same slowdown — the classic cascading failure.

There are two distinct timeout values to set for every remote call:

Connect timeout — how long to wait for the TCP handshake to complete (typically 1–5 s). A very long connect timeout usually means the host is unreachable or overloaded before it even starts processing.
Read (request) timeout — how long to wait for the full response after the connection is open. This must reflect the expected worst-case processing time of the operation, not just network latency.

Rule of thumb: Set your read timeout to roughly 2–3× the p99 latency of the downstream service under normal load — tight enough to fail fast, but loose enough not to trigger false timeouts during legitimate spikes.

Real-world reference points from large production systems: Google's internal RPC framework defaults to a 5-second deadline; AWS SDK defaults for DynamoDB are a 2-second connect / 10-second read; Stripe recommends 80 seconds for payment API calls (because some operations are truly slow). There is no universal number — each dependency needs its own tuned value.

Retries: When and How

A timeout tells you the call failed. A retry tries again. But naive retries cause serious problems:

Retry storms: when a downstream service is slow or returning errors, hundreds of clients all retry simultaneously, multiplying load exactly when the service can least handle it.
Non-idempotent operations: retrying a payment or order-creation without idempotency keys causes duplicate charges or orders.

Retries make sense only for transient failures — network blips, momentary overload (HTTP 429 or 503), brief unavailability. They are counterproductive for:

Client errors such as 400 Bad Request or 404 Not Found — no amount of retrying will change a malformed request.
Persistent server errors that indicate the downstream is fundamentally broken.

Best practice: Limit retries to 2–3 attempts maximum in most cases. Always retry only on known-safe error classes (network errors, HTTP 429, HTTP 503). For non-idempotent writes, use a client-generated idempotency key (a UUID sent with every attempt) so the server can detect and deduplicate retries.

Exponential Backoff with Jitter

Exponential backoff means that the wait time between retries grows exponentially: attempt 1 waits 1 s, attempt 2 waits 2 s, attempt 3 waits 4 s, and so on. This gives the downstream service progressively more time to recover before being hit again.

But pure exponential backoff has a hidden trap: if thousands of clients all hit the same error at the same time, they all back off by the same amounts and then all retry simultaneously — a thundering herd. The fix is jitter: add a random component to each backoff interval so retries spread out across time.

A common formula (AWS recommends this):

wait = min(cap, base * 2^attempt) * random(0.5, 1.0)

// Example values:
//   base = 100ms,  cap = 30s,  attempt = 0, 1, 2, 3 ...
// attempt 0: min(30000, 100)       * rand  =  ~50–100 ms
// attempt 1: min(30000, 200)       * rand  =  ~100–200 ms
// attempt 2: min(30000, 400)       * rand  =  ~200–400 ms
// attempt 3: min(30000, 800)       * rand  =  ~400–800 ms
// attempt 6: min(30000, 6400)      * rand  =  ~3.2–6.4 s
// attempt 9: min(30000, 51200→cap) * rand  =  ~15–30 s  (capped)

The cap prevents the wait from growing without bound. random(0.5, 1.0) spreads retries across a window instead of aligning them. This pattern is called full jitter (there are variants like equal jitter and decorrelated jitter, but full jitter is the most widely used).

Two failed attempts with growing backoff + jitter windows, followed by a successful third attempt.

Total Timeout Budgets and Deadline Propagation

When Service A calls Service B which calls Service C, each hop has its own timeout. If A gives B 500 ms and B gives C 500 ms, and B itself takes 200 ms of processing, then C effectively has only 300 ms — but its timeout is set to 500 ms. It will keep trying long after A has already given up and returned an error to the user. This wastes resources and causes orphaned work.

The solution is deadline propagation (also called a request budget): the original caller sets a wall-clock deadline and passes it through every hop as a request header (e.g., grpc-timeout in gRPC, or a custom X-Request-Deadline header in HTTP). Each downstream service checks whether any budget remains before starting work, and aborts immediately if the deadline has already passed.

Deadline propagation: a single wall-clock deadline is passed through every service hop, shrinking the remaining budget at each stage.

Idempotency: The Safety Net for Retries

Retrying a GET is always safe — reading data twice returns the same result. Retrying a POST /payments without safeguards charges the customer twice. Idempotency is the property that performing an operation multiple times produces the same result as performing it once.

The standard pattern: the client generates a UUID per logical operation (not per attempt) and sends it as a header or body field (Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000). The server stores the result keyed by that UUID for a short window (typically 24 hours). On any retry with the same key, the server returns the stored result without re-executing the operation.

Pitfall — retrying non-idempotent calls without keys: A payment or order microservice that does not implement idempotency keys will silently duplicate operations on retry. This is one of the most expensive bugs in production systems. Always implement idempotency keys for any state-changing remote call that may be retried.

Putting It Together: A Resilient Call Pattern

A production-grade remote call combines all three mechanisms:

Set a connect timeout (1–3 s) and a read timeout based on the downstream p99.
On timeout or retryable error: wait using exponential backoff + full jitter.
Limit retries to 2–3 attempts; give up and return an error to the caller.
Propagate the deadline in every downstream call so no hop wastes resources after the user-facing timeout has already fired.
Use idempotency keys on all non-safe, non-idempotent operations.

These five steps prevent the most common classes of distributed-system failures at the call level. The next lesson layers on circuit breakers and bulkheads — which operate at a higher level, deciding whether to attempt a call at all based on recent failure history.