Fault Tolerance

Fault tolerance is the ability of a system to continue operating correctly in the presence of failures of some of its components. A fault-tolerant system detects failures, isolates them, and recovers without losing correctness or availability.

Terminology

Fault: the underlying cause (a hardware defect, a software bug, a dropped packet).

Error: the manifestation of a fault in the system’s state.

Failure: when a system component deviates from its specified behavior, resulting in incorrect output or unavailability.

Fault tolerance: the system provides correct service despite faults.

Resilience: the system’s ability to adapt to and recover from faults quickly.

Types of Faults

Hardware faults: disk crashes, RAM bit flips, CPU failures, power outages, NIC failures. Rate: a typical disk has a 1-5% annual failure rate; in a 10,000-disk cluster, 27-137 disks fail per day.

Software faults: bugs that crash a process, memory leaks, infinite loops, off-by-one errors, incorrect assumptions about dependencies.

Human errors: misconfiguration is the leading cause of outages. Incorrect firewall rules, wrong database credentials, accidentally deleting data.

Network faults: packet loss, latency spikes, partitions (one part of the network cannot communicate with another), asymmetric partitions (A can reach B but B cannot reach A).

Failure Detectors

See System Models for formal treatment.

Heartbeats: process $A$ sends periodic pings to $B$; if $B$ does not reply within a timeout, $A$ suspects $B$ has failed.

Phi Accrual Failure Detector (Cassandra, Akka): instead of a binary yes/no, computes a suspicion level $\phi$ based on observed heartbeat intervals. The application sets a threshold for when to treat the process as failed.

Adaptive timeouts: measure round-trip times; set timeout as a multiple (e.g., $2\times$) of the 99th-percentile RTT.

Redundancy

The fundamental technique for fault tolerance: have multiple copies so that failing one does not cause system failure.

Hardware redundancy: RAID for disks; redundant power supplies; ECC memory; dual NICs.

Information redundancy: error-correcting codes (ECC, LDPC, Reed-Solomon) for detecting and correcting data corruption.

Time redundancy: retry operations; transactions with rollback on failure.

Process redundancy: run multiple instances; if one fails, another takes over.

Checkpointing and Recovery

Checkpoint: save the state of a process or system to stable storage at regular intervals. On failure, restart from the last checkpoint rather than from the beginning.

Recovery log: record every state change. On recovery, replay the log from the last checkpoint.

Message logging: record all messages received. On recovery, replay messages to restore state after the checkpoint.

Circuit Breaker Pattern

Prevent a failing service from being repeatedly called when it is clearly unavailable.

States:

Closed (normal): requests pass through; failure rate monitored.
Open: failure rate exceeded threshold; requests fail immediately without attempting the call.
Half-open: after a timeout, allow a limited number of test requests; if they succeed, close; if they fail, open again.

Benefits: fail fast; prevent cascading failures; allow failing services time to recover.

Implementations: Hystrix (Netflix, now maintenance mode), Resilience4j, Envoy (circuit breaking in service mesh).

Bulkhead Pattern

Isolate components so that a failure in one does not consume all resources and cascade.

Analogy: ship bulkheads divide the hull into watertight compartments; one leak does not sink the ship.

Implementation: use separate thread pools, connection pools, or process groups for different operations. If one pool is exhausted, others are unaffected.

Timeouts and Retries

Timeouts: every remote call must have a timeout. Without it, a slow server can cause the client to block indefinitely, consuming threads or connections.

Retries: on transient failure (network glitch, server overloaded), retry the request. Only safe for idempotent operations.

Exponential backoff with jitter: do not retry immediately. Wait $1 \text{ s}, 2 \text{ s}, 4 \text{ s}, \ldots, \text{max}$. Add random jitter to avoid the thundering herd problem (all clients retry at the same time).

delay = min(cap, base * 2^attempt) * random(0.5, 1.0)

Chaos Engineering

Deliberately inject failures into a production system to verify fault tolerance.

Netflix Chaos Monkey: randomly terminates EC2 instances. Forces engineers to design services that tolerate instance failures.

Chaos Kong: terminates an entire AWS region.

Principles: define steady state; hypothesize it holds under failure; run experiments; verify or disprove.

Tools: Chaos Monkey, Gremlin, AWS Fault Injection Service, Chaos Mesh (Kubernetes).

Distributed Coordination Services

Fault-tolerant coordination primitives needed by distributed applications.

Apache Zookeeper: strongly consistent (linearizable) key-value store. Provides: distributed locks, leader election, service discovery, configuration management. Uses Zab (Zookeeper Atomic Broadcast), a Paxos variant.

etcd: consensus-based key-value store using Raft. Used by Kubernetes for cluster state. Provides: distributed locks, leader election, watch (notification on key changes).

Use cases: naming services, configuration distribution, distributed locks, group membership.

Recovery Strategies

Fail-stop: stop the process on fault; let another replica take over. Simple; requires external management.

Restart in place: the failed process restarts from a saved checkpoint on the same node. Works if the fault was transient (e.g., OOM killed due to temporary spike).

Failover: promote a replica to primary. Requires replication lag to be minimal (synchronous replication or fast catchup).

Active-active: multiple nodes handle requests simultaneously; a failure of one is immediately absorbed by the others. Highest availability; requires conflict handling for writes.