Distributed Systems
Definition
A distributed system is a collection of independent computers that appears to its users as a single coherent system. The computers communicate by message passing over a network.
Goals
- Resource sharing: Access remote hardware/software (files, CPUs, printers)
- Transparency: Hide distribution from users (location, migration, replication)
- Openness: Standard interfaces allowing heterogeneous systems to interoperate
- Scalability: Handle growth in users, data, and geography
- Fault tolerance: Continue operating despite component failures
Challenges
No shared clock: Nodes have independent clocks that drift. No global “now.”
No shared memory: All communication is via messages. Messages may be delayed, reordered, or lost.
Partial failures: Some nodes can fail while others continue. Detecting failures is itself unreliable.
Network partitions: Network can split, cutting communication between groups of nodes.
Fallacies of Distributed Computing (Peter Deutsch)
- The network is reliable
- Latency is zero
- Bandwidth is infinite
- The network is secure
- Topology doesn’t change
- There is one administrator
- Transport cost is zero
- The network is homogeneous
Clocks and Time
Physical Clocks
Hardware clocks drift. NTP (Network Time Protocol) synchronizes clocks over the network. Typical accuracy: 1-10ms over the internet, sub-millisecond on LAN.
Cristian’s algorithm: Client asks server for time, adjusts for round-trip delay.
Clock synchronization cannot achieve perfect accuracy; messages take variable time.
Logical Clocks
Lamport timestamps: Capture happens-before relationships.
- Each process maintains a counter
- Increment on each event
- Send counter with message
- On receive:
counter = max(local, received) + 1
If A → B (A happens before B), then L(A) < L(B).
But L(A) < L(B) does NOT mean A → B (only a necessary condition, not sufficient).
Vector clocks: Each process maintains a vector of counters (one per process).
- Precisely capture causal relationships
V(A) < V(B)iff A causally precedes B- Can detect concurrent events (neither happened before the other)
CAP Theorem (Brewer’s Theorem)
A distributed system can guarantee at most two of:
- Consistency: Every read sees the most recent write
- Availability: Every request receives a response (not an error)
- Partition tolerance: System continues operating despite network partitions
Since partitions are unavoidable in real networks, systems must choose CP or AP.
CP: Consistent but may reject requests during partition (etcd, HBase, ZooKeeper) AP: Always available but may return stale data during partition (Cassandra, DynamoDB, CouchDB)
PACELC (extension): Also considers the latency-consistency tradeoff even without partitions.
Consistency Models
From strongest to weakest:
| Model | Description |
|---|---|
| Linearizability | Operations appear atomic, as if on one machine |
| Sequential consistency | All processes see same order of operations |
| Causal consistency | Causally related operations seen in order |
| Eventual consistency | All replicas converge to the same value eventually |
Eventual consistency: If no new updates, all replicas eventually return the same value. Allows for high availability and performance. Used by DNS, Amazon S3, Cassandra.
Consensus
Agreement among distributed nodes on a single value despite failures.
Requirements:
- Agreement: All correct nodes decide the same value
- Validity: The decided value was proposed by some node
- Termination: All correct nodes eventually decide
FLP Impossibility: In an asynchronous system with even one crash-failure, consensus is impossible. (Fischer, Lynch, Paterson 1985)
Practical solutions relax the asynchrony assumption or use randomization.
Paxos
Classic consensus algorithm. Two phases:
Phase 1 (Prepare/Promise):
- Proposer sends Prepare(n) to majority
- Acceptors promise not to accept proposals < n; return highest accepted value
Phase 2 (Accept/Accepted):
- Proposer sends Accept(n, value) to majority
- Acceptors accept if n ≥ promised
If a majority of acceptors accept → value is chosen.
Paxos is correct but complex to understand and implement.
Raft
Consensus designed for understandability. Used in etcd, CockroachDB, TiKV.
Leader election: One leader per term. Leader handles all writes.
Log replication:
- Client sends command to leader
- Leader appends to log, sends AppendEntries to followers
- Once majority acknowledge, leader commits, applies to state machine
- Leader notifies followers, they commit
Properties:
- Leader has all committed entries
- Log matching: if two logs have same index and term, they are identical up to that point
- State machine safety: if a server applies an entry, no other server applies different entry for same index
ZooKeeper (ZAB protocol)
Used for distributed coordination (service discovery, configuration, leader election). Similar to Paxos.
Distributed Transactions
Two-Phase Commit (2PC)
Phase 1 (Prepare):
- Coordinator sends Prepare to all participants
- Each participant logs to durable storage, votes Yes or No
Phase 2 (Commit or Abort):
- If all voted Yes → Coordinator sends Commit
- If any voted No → Coordinator sends Abort
Problem: If coordinator fails after Prepare but before Commit → participants are blocked (hold locks) until coordinator recovers. 2PC is blocking.
Three-Phase Commit (3PC)
Adds a pre-commit phase to allow recovery without coordinator. Non-blocking but assumes no network partitions.
Saga Pattern
For microservices: break a long transaction into a sequence of local transactions, each with a compensating transaction for rollback. Eventually consistent.
Replication
Maintain copies of data on multiple nodes for fault tolerance and read scaling.
Primary-Backup (Single Leader)
One primary handles writes, backups replicate asynchronously or synchronously.
- Simple, strong consistency possible
- Single point of bottleneck for writes
- Examples: MySQL replication, PostgreSQL streaming replication
Multi-Leader
Multiple nodes accept writes. Conflict resolution needed.
- Better write availability and latency
- Complex conflict resolution
- Examples: CouchDB, DynamoDB global tables
Leaderless (Quorum)
Writes go to W nodes, reads from R nodes. Consistent if R + W > N (quorum).
- No single point of failure
- Tunable consistency vs. availability
- Examples: Cassandra (N=3, W=2, R=2), Amazon Dynamo
Distributed File Systems
NFS: Network File System. Simple, stateless (v2/v3). Client caches, no consistency guarantees.
HDFS (Hadoop): Write-once, large files, streaming reads. NameNode (metadata), DataNodes (blocks).
GFS/Colossus (Google): Append-heavy workloads, large files, commodity hardware.
Ceph: POSIX-compliant, no single point of failure. CRUSH algorithm for data placement.
Remote Procedure Call (RPC)
Call a procedure on a remote machine as if it were local.
- Client calls stub
- Stub marshals arguments, sends message
- Server stub unmarshals, calls procedure
- Result marshaled and sent back
- Client stub returns result
Exactly-once semantics is hard; network failures mean you may not know if the call executed. Design for idempotency (at-least-once is safer).
Frameworks: gRPC (Google, uses Protocol Buffers), Thrift (Meta), JSON-RPC.
MapReduce
Programming model for large-scale data processing.
Map phase: Apply a function to each input record → produce (key, value) pairs Shuffle: Group all values by key Reduce phase: Aggregate values for each key
- Automatic parallelization, fault tolerance (re-execute failed tasks)
- Handles stragglers with speculative execution
- Replaced largely by Spark (in-memory) and streaming systems (Kafka, Flink)