Group Communication

This note summarizes group communication protocols, focusing on the design, trade-offs, and mechanics of the ISIS Group Communication System developed at Cornell University in the 1980s.

Introduction & Motivation: The ISIS System

The ISIS Group Communication System provides reliable, ordered message delivery services for distributed applications. Historically, ISIS has been used in critical, real-world systems such as NASDAQ and the Boeing 777 software control system.

The Replicated State Machine (RSM) Dilemma

To build a Replicated State Machine (RSM) that does not diverge, a system needs a single total ordering on all events (the gold standard of strict consistency). However, enforcing a total order is:

  • Expensive & Poorly Scalable: Ordering everything requires coordination across all nodes.
  • Speed-of-Light Bound: The latency of coordination is ultimately limited by the physical speed of light over distances.

The Solution: Relaxed Consistency

Instead of forcing every part of a distributed system to be a strict RSM:

  1. Identify the small subsets of the system that absolutely require strict RSM semantics.
  2. Allow the rest of the system to operate under weaker/relaxed ordering semantics to maximize performance by reducing coordination and message serialization.

The Communication & Semantics Spectrum

Group communication protocols trade off ordering guarantees against reliability and performance costs.

"\\begin{document}\n\\begin{tikzpicture}[>=stealth, scale=1.4]\n% Axis line drawing\n\\draw[->, thick] (0,0) -- (6.5,0);\n\\draw[->, thick] (0,0) -- (0,6.5);\n\n% Axis labels inside the chart\n\\node[above left] at (6.5, 0.15) {\\small \\underline{Reliability (Atomicity)}};\n\\node[below right] at (0.15, 6.3) {\\small \\underline{Ordering Guarantees}};\n\n% X-axis Ticks\n\\draw (1, 0.1) -- (1, -0.1) node[below=4pt] {\\small Best Effort};\n\\draw (3.5, 0.1) -- (3.5, -0.1) node[below=4pt] {\\small Reliable Channel};\n\\draw (5.5, 0.1) -- (5.5, -0.1) node[below=4pt] {\\small Atomic / Agreement};\n\n% Y-axis Ticks\n\\draw (0.1, 1) -- (-0.1, 1) node[left=5pt] {\\small None};\n\\draw (0.1, 2) -- (-0.1, 2) node[left=5pt] {\\small FIFO};\n\\draw (0.1, 3) -- (-0.1, 3) node[left=5pt] {\\small Causal};\n\\draw (0.1, 4) -- (-0.1, 4) node[left=5pt] {\\small Causal Total};\n\\draw (0.1, 5) -- (-0.1, 5) node[left=5pt] {\\small Total Order};\n\\draw (0.1, 6) -- (-0.1, 6) node[left=5pt] {\\small Total Wall Clock};\n\n% Draw nodes (transparent background to allow text to adapt to light/dark themes natively)\n\\node[draw, rectangle, rounded corners, inner sep=4pt] (udp) at (1, 1) {\\small\\textbf{UDP}};\n\\node[draw, rectangle, rounded corners, inner sep=4pt] (tcp) at (3.5, 2) {\\small\\textbf{TCP}};\n\\node[draw, rectangle, rounded corners, inner sep=4pt] (cbcast) at (5, 3) {\\small\\textbf{CBCAST}};\n\\node[draw, rectangle, rounded corners, inner sep=4pt] (abcast) at (5.5, 5) {\\small\\textbf{ABCAST}};\n\n% Projection lines\n\\draw[dashed, gray!60] (udp) -- (1, 0);\n\\draw[dashed, gray!60] (udp) -- (0, 1);\n\n\\draw[dashed, gray!60] (tcp) -- (3.5, 0);\n\\draw[dashed, gray!60] (tcp) -- (0, 2);\n\n\\draw[dashed, gray!60] (cbcast) -- (5, 0);\n\\draw[dashed, gray!60] (cbcast) -- (0, 3);\n\n\\draw[dashed, gray!60] (abcast) -- (5.5, 0);\n\\draw[dashed, gray!60] (abcast) -- (0, 5);\n\\end{tikzpicture}\n\\end{document}"Reliability(Atomicity)OrderingGuaranteesBestE®ortReliableChannelAtomic/AgreementNoneFIFOCausalCausalTotalTotalOrderTotalWallClockUDPTCPCBCASTABCAST
source code
  • UDP: Best effort, unreliable transport.
  • TCP: Guarantees FIFO delivery per channel. Cannot guarantee 100% reliability if endpoints fail (due to the impossibility of acknowledging ACKs infinitely).
  • CBCAST: Relaxed causal ordering where messages respect the happens-before relationship, allowing high concurrency.
  • ABCAST: Strict total ordering where every node delivers messages in the exact same order, which can clog queues during coordination.

Types of Ordering Semantics

  • Total Wall Clock Ordering: All messages are ordered strictly by the physical time they were sent. (Extremely difficult to achieve precisely in distributed systems due to clock drift).

  • Sequential Consistency (Total Order): All nodes deliver all messages in the same sequence, but this sequence does not necessarily correspond to physical wall-clock time.

  • Causal Ordering: Respects the “happens-before” () relationship. If message causally influenced1 the sending of , then must be delivered before at all destinations. Unrelated (concurrent) messages do not have a defined order, which allows for parallel processing.

  • Causal Total Ordering: Combines both causal and total ordering guarantees. This can be achieved by layering ABCAST on top of CBCAST. Uses Vector Timestamps to track causality.

CBCAST (Causal Broadcast)

CBCAST guarantees that messages are delivered respecting causal relationships.

Mechanics & Rules

Every node maintains a local message buffer (FIFO queue). To satisfy causal consistency:

  1. FIFO Sender Order: Messages sent by a specific node must be delivered in the order they were sent.
    • Implementation: A node maintains a FIFO queue of outgoing messages. When connecting to another node via TCP, it sends the entire queue from the beginning. If the connection drops, it restarts transmission from the beginning of the queue, discarding duplicates at the receiver (idempotency).
  2. Transitive Causality: If a node receives a message and subsequently sends message , then any node receiving must deliver before .
    • Implementation: When a node receives a message, it appends it to its local queue. Thus, when it talks to another node, it transmits not only its own messages but all messages it has “heard” so far. The “so far” is explicitly the “happens-before” relation.
Example Buffer Queue containing locally generated and transitively heard messages:
+----+----+----+----+----+----+----+
| B1 | A4 | C1 | A3 | A2 | A1 | ...| <-- Front of Queue (sent first)
+----+----+----+----+----+----+----+

Optimization via Vector Timestamps (VTS)

To avoid sending the entire history and to detect duplicate messages:

  • Each queue state/message is labeled with a Vector Timestamp (VTS) summarizing the highest sequence number seen from each node.
  • If node sends a message with VTS entry , the receiver checks its own VTS. If the receiver’s VTS for , it discards the message as a duplicate (because it only updates its own VTS if it had seen the before). If it is exactly , it accepts it. If it is , it buffers the message and waits for the missing causal history to arrive.

Trace Example 1

Suppose we have three nodes , , and , all starting with VTS initialized to [0, 0, 0].

Initial VTS State

A: [0, 0, 0]
B: [0, 0, 0]
C: [0, 0, 0]

Event 1: Transmission

Node broadcasts message , incrementing its own index in its VTS. The message is sent with metadata .

A: [1, 0, 0]
B: [0, 0, 0]
C: [0, 0, 0]

Event 2: Causal Dependency

Node receives and delivers (updating its VTS to [1, 0, 0]), and subsequently broadcasts message , incrementing its VTS. The message is sent with metadata . Because delivered before sending , a causal dependency is established: .

A: [1, 0, 0]
B: [1, 1, 0]
C: [0, 0, 0]

Event 3: Out-of-Order Arrival & Buffering

Due to network latency, Node receives (with VTS [1, 1, 0]) first, before has arrived. Node checks the causal delivery rules:

  • (expected next message from , holds).
  • (failed: , since has not yet delivered ). Since the causal constraint is not satisfied, Node buffers .
A: [1, 0, 0]
B: [1, 1, 0]
C: [0, 0, 0]  |  Buffered: [b_1 (VTS: [1, 1, 0])]

Event 4: Delivery and Resolution

Node finally receives (with VTS [1, 0, 0]).

  1. Node delivers immediately, since its causal dependency is met: and . Node ‘s VTS updates to [1, 0, 0].
C: [1, 0, 0]  |  Buffered: [b_1 (VTS: [1, 1, 0])]
  1. Node checks its buffered messages. Since is now [1, 0, 0], the causal constraint for is satisfied (). delivers , and updates its VTS.
A: [1, 0, 0]
B: [1, 1, 0]
C: [1, 1, 0]  |  Buffered: []

Trace Example 2: Concurrent Messages

When messages are concurrent (), CBCAST does not enforce any ordering between them, allowing different nodes to deliver them in different orders without blocking.

Suppose we have three nodes , , and , all starting with VTS [0, 0, 0].

A: [0, 0, 0]
B: [0, 0, 0]
C: [0, 0, 0]

Event 1: Concurrent Transmission

Node and Node broadcast messages and at the same time.

  • is sent with .
  • is sent with .
A: [1, 0, 0]
B: [0, 1, 0]
C: [0, 0, 0]

Event 2: Deliveries at Node C

Node receives first, then .

  1. Node receives and delivers immediately (VTS becomes [1, 0, 0]).
  2. Node receives and delivers immediately, since is satisfied (VTS of becomes [1, 1, 0]). Delivery Order at C: .
C: [1, 1, 0]

Event 3: Delivery at Node B

Node (having delivered its own message locally) receives . Since

Node delivers immediately (VTS becomes [1, 1, 0]). Delivery Order at B: .

B: [1, 1, 0]

Event 4: Delivery at Node A

Node (having delivered its own message locally) receives . Since

Node delivers immediately (VTS becomes [1, 1, 0]). Delivery Order at A: .

A: [1, 1, 0]

Summary: Because and are causally independent, CBCAST delivers them without any coordination, resulting in Node delivering while Nodes and deliver .

Atomicity & Garbage Collection

  • Atomic Propagation: If a message reaches at least one alive node before the sender dies, it will eventually propagate to all other alive nodes through transitive gossip, ensuring atomicity.
  • Garbage Collection (REM_DST): To prevent the queue from growing indefinitely, each message tracks a remaining destination set (REM_DST).
    • As nodes acknowledge receipt, they are removed from REM_DST.
    • Once REM_DST is empty, the message is garbage collected from the queue.
    • Optimization: Senders can include metadata indicating which nodes have already received the message to prevent redundant transmissions (similar to Lazy Release Consistency / TreadMarks diff propagation using lock acquisition).

ABCAST (Atomic Broadcast)

ABCAST guarantees total ordering of messages across all group members. Unlike CBCAST, nodes cannot immediately process messages; they must wait for the group to agree on a final sequence number.

The Challenge: Queue Clogging

Every node maintains a local queue. Since messages must be delivered in the exact same total order, if the message at the front of the queue is waiting for order agreement (i.e., is not yet marked deliverable), the entire queue is clogged. No messages behind it can be delivered to the application, even if their ordering is already decided.

Protocol Steps

  1. Initiation: The sender broadcasts a message to all nodes.
  2. Temporary Sequence: Each node places the message in its local queue, marks it as undeliverable (temporary), and assigns it a temporary sequence number (usually the local counter). It sends this number as an ACK proposal back to the sender.
  3. Agreement: The sender collects all proposals, selects the maximum sequence number, and broadcasts this final sequence number.
  4. Commit & Sort: Each node updates the message’s sequence number to the final maximum, marks it as deliverable, and re-sorts its queue based on the final sequence numbers (breaking ties using node IDs).
  5. Delivery: A message is delivered to the application only when it is at the front of the queue and marked deliverable.

Trace Example 3

Suppose node sends message and node sends message concurrently. Node is a receiver.

Initial Queue State (Proposed Sequences)

Messages are placed in the queues with temporary proposed sequences. Sequence number tuples are in the format seq.node_id (e.g., 2a represents sequence number 2 proposed by node A).

[Queue End]                                          [Queue Front]
    v                                                      v
A:  b(2a) [Undeliverable]    |   a(1a) [Undeliverable]
B:  a(2b) [Undeliverable]    |   b(1b) [Undeliverable]
C:  b(2c) [Undeliverable]    |   a(1c) [Undeliverable]

Step 1: Proposal Collection

  • For message : Proposals collected by are 1.A (from ), 2.B (from ), and 1.C (from ). Max sequence = 2.B.
  • For message : Proposals collected by are 2.A (from ), 1.B (from ), and 2.C (from ). Max sequence = 2.C.

Step 2: Final Sequence Broadcast & Re-sorting

The senders broadcast the chosen final sequence numbers (2.B for message , 2.C for message ). Nodes update the sequence numbers, mark them as deliverable, and re-sort their queues. Since 2.B < 2.C (by alphabetical tie-breaker), sorts before .

[Queue End]                                    [Queue Front]
    v                                                v
A:  a(2b) [Deliverable]     |   b(2a) [Undeliverable]  <-- a is Clogged by b
B:  a(2b) [Deliverable]     |   b(1b) [Undeliverable]  <-- a is Clogged by b
C:  b(2c) [Undeliverable]   |   a(2b) [Deliverable]    <-- a is deliverable

Step 3: Delivery and Resolution

  • Node C: Message (final sequence 2.B) is at the front of the queue and marked deliverable. Node delivers immediately.
  • Nodes A and B: Message is deliverable but is blocked because the messages at the front of their queues (b(2a) and b(1b)) are still undeliverable. The queues are clogged.
  • Resolution: In the fullness of time, node decides the final sequence number for (2.C) and broadcasts it. Nodes A and B update to 2.C (deliverable) and re-sort their queues:
A:  b(2c) [Deliverable]  |  a(2b) [Deliverable]    <-- a can be delivered
B:  b(2c) [Deliverable]  |  a(2b) [Deliverable]    <-- a can be delivered
C:                       |  b(2c) [Undeliverable]  <-- waiting for b's decision

(Once is updated to 2.C on all nodes, both and are delivered sequentially across all nodes).

Footnotes

  1. Message causally influences if the sender of had already sent or received (directly or transitively) before generating (i.e., via Lamport’s happens-before relation).