The SRM (Scalable Reliable Multicast) argues that, unlike reliable unicast protocols, different applications vary widely in their requirements for reliable multicast delivery. Hence:
One cannot make a single reliable multicast delivery scheme that simultaneously meets the functionality, scalability, and efficiency requirements of all applications.
Hence, they propose a model in which a generic skeleton is “fleshed out with application specific details.” The application provides a means to talk about how much data has been sent and received (“application data units”, unlike the connection-state oriented acknowledgments in TCP), a means for allocating bandwidth among the members of the group, and a means for individual nodes to decide how to allocate their local outgoing bandwidth. Following this model, their multicast framework only provides best-effort packet delivery, with possible duplication and reordering of packets — they believe that applications can build stronger reliable delivery and ordering properties on top of their framework, as needed. Best-effort multicast is actually implemented using IP multicast.
Unicast vs. Multicast
The authors point out two differences between reliable delivery for unicast vs. multicast that I thought were interesting:
- In any reliable delivery protocol, one party must take responsibility for detecting lost data and retransmitting it. In unicast, either the sender or receiver can play this role equally well (TCP uses the sender, other protocols like NETBLT use the receiver). In multicast, sender-side delivery state is problematic: the sender must track the set of active recipients and the current state of each recipient, which is expensive, and difficult to do as the multicast group changes. In some sense, the whole point of multicast is to relieve the sender of that responsibility. Hence, they argue that receiver-side delivery state is better for multicast: group membership isn’t relevant, and the burden of keeping per-receiver state is avoided.
- A reliable delivery protocol also needs a vocabulary to talk about how much data has been sent or received. Typical unicast protocols use a vocabulary based on communication state (typically either bytes or packets (segments)). This is not ideal for multicast, because a newly-joining recipient doesn’t share that communication state, and hence can’t interpret byte-oriented or packet-oriented messages. SRM instead argues for talking in terms of “application data units.”
The example multicast application in the paper is wb, a shared distributed whiteboard. This has the advantage that protocol commands (e.g. draw shape X at location Y) are mostly idempotent (if associated with a timestamp), so the underlying multicast protocol doesn’t need to enforce a total order on deliveries.
In wb, new operations are multicast to the entire group. When a receiver detects a loss (by detecting gaps in sequence numbers), it starts a “repair request” timer. The timer value is determined by the distance of the receiver from the original data source. When the timer expires, the recipient multicasts a “repair request” to the entire group, asking for the missing data. If a recipient sees another repair request for the same data before its timer expires, it suppresses its own repair request. Retransmission of missing data is handled similarly: when nodes receive repair requests for data they have seen, they start a response timer based on their distance from the repair request source. When the timer expires, they multicast the requested data to the entire group. Any other nodes that have a response timer for the requested data suppress their own timers. They also propose a bandwidth allocation scheme to divide the available bandwidth among new data operations and repair data.
Tuning the Request/Response Repair Timers
It is important that the repair request and repair response timers at different nodes be de-synchronized, to avoid redundant messages. The paper observes that certain topologies require methods for achieving de-synchronization: in a simple “chain” topology, seeding the timer with network distance is sufficient. For a “star” topology, all nodes are the same distance from the data source, so randomization must be used. A combination of these techniques must be used for a tree topology. Rather than requiring the timers be tuned for each individual network, they instead propose an adaptive algorithm that uses prior request/response behavior to tune the timers automatically.