Tag Archives: multicast

“Scalable Reliable Multicast”

The SRM (Scalable Reliable Multicast) argues that, unlike reliable unicast protocols, different applications vary widely in their requirements for reliable multicast delivery. Hence:

One cannot make a single reliable multicast delivery scheme that simultaneously meets the functionality, scalability, and efficiency requirements of all applications.

Hence, they propose a model in which a generic skeleton is “fleshed out with application specific details.” The application provides a means to talk about how much data has been sent and received (“application data units”, unlike the connection-state oriented acknowledgments in TCP), a means for allocating bandwidth among the members of the group, and a means for individual nodes to decide how to allocate their local outgoing bandwidth. Following this model, their multicast framework only provides best-effort packet delivery, with possible duplication and reordering of packets — they believe that applications can build stronger reliable delivery and ordering properties on top of their framework, as needed. Best-effort multicast is actually implemented using IP multicast.

Unicast vs. Multicast

The authors point out two differences between reliable delivery for unicast vs. multicast that I thought were interesting:

  • In any reliable delivery protocol, one party must take responsibility for detecting lost data and retransmitting it. In unicast, either the sender or receiver can play this role equally well (TCP uses the sender, other protocols like NETBLT use the receiver). In multicast, sender-side delivery state is problematic: the sender must track the set of active recipients and the current state of each recipient, which is expensive, and difficult to do as the multicast group changes. In some sense, the whole point of multicast is to relieve the sender of that responsibility. Hence, they argue that receiver-side delivery state is better for multicast: group membership isn’t relevant, and the burden of keeping per-receiver state is avoided.
  • A reliable delivery protocol also needs a vocabulary to talk about how much data has been sent or received. Typical unicast protocols use a vocabulary based on communication state (typically either bytes or packets (segments)). This is not ideal for multicast, because a newly-joining recipient doesn’t share that communication state, and hence can’t interpret byte-oriented or packet-oriented messages. SRM instead argues for talking in terms of “application data units.”

wb Framework

The example multicast application in the paper is wb, a shared distributed whiteboard. This has the advantage that protocol commands (e.g. draw shape X at location Y) are mostly idempotent (if associated with a timestamp), so the underlying multicast protocol doesn’t need to enforce a total order on deliveries.

In wb, new operations are multicast to the entire group. When a receiver detects a loss (by detecting gaps in sequence numbers), it starts a “repair request” timer. The timer value is determined by the distance of the receiver from the original data source. When the timer expires, the recipient multicasts a “repair request” to the entire group, asking for the missing data. If a recipient sees another repair request for the same data before its timer expires, it suppresses its own repair request. Retransmission of missing data is handled similarly: when nodes receive repair requests for data they have seen, they start a response timer based on their distance from the repair request source. When the timer expires, they multicast the requested data to the entire group. Any other nodes that have a response timer for the requested data suppress their own timers. They also propose a bandwidth allocation scheme to divide the available bandwidth among new data operations and repair data.

Tuning the Request/Response Repair Timers

It is important that the repair request and repair response timers at different nodes be de-synchronized, to avoid redundant messages. The paper observes that certain topologies require methods for achieving de-synchronization: in a simple “chain” topology, seeding the timer with network distance is sufficient. For a “star” topology, all nodes are the same distance from the data source, so randomization must be used. A combination of these techniques must be used for a tree topology. Rather than requiring the timers be tuned for each individual network, they instead propose an adaptive algorithm that uses prior request/response behavior to tune the timers automatically.

Leave a comment

Filed under Paper Summaries

“Scalable Application Layer Multicast”

Multicast is clearly an efficient technique for applications with one-to-many communication patterns, but the deployment of in-network multicast has been slow. Therefore, there have been a number of proposals for implementing multicast at the application level, as an overlay over the physical network. This is not as efficient as true in-network multicast (because the same message may be sent over the same link multiple times), but is much more flexible and easier to deploy.

Scalable Application Layer Multicast” is one such proposal for application-layer multicast via an overlay network; their proposed protocol is called NICE. Their focus is on applications that require low-latency delivery of relatively low-bandwidth data streams to a large set of recipients, although they argue that their techniques are also applicable to high-volume streams with some minor changes.

Network Topology

In NICE, nodes are arranged into layers; a single node can appear in more than one layer. Every node belongs to the lowest layer, L0. The nodes in a layer are arranged (automatically) into clusters of nodes that are “near” to one another (in terms of network distance/latency). Each cluster has a leader, which is the node in the “center” of the cluster (NICE tries to make the leader the node that has the smallest maximal latency to the other nodes in the cluster). Layer L1 consists of all the cluster leaders from L0; the clustering algorithm is applied to L1 in turn, yielding another set of cluster leaders which form L2, and so forth. The height of the tree is determined by the number of nodes and a constant k: each cluster contains between k and 3k-1 nodes.

To multicast a data message, a node forwards the message to every cluster peer in every layer in which the node belongs, except that a node never forwards a message back to the message’s previous hop.


To join the multicast group, a node begins by contacting a designated node called the Rendezvous Point (RP). The RP is typically the root of the NICE tree. The joining node walks down the tree from the root, choosing the child node that is closest to it (lowest latency).

Cluster leaders periodically check whether the cluster size constraint (k <= size <= 3k-1) has been violated; if so, they initiate a cluster merge or split, as appropriate. Splitting a cluster into two clusters is done by trying to minimize the maximum of the radii of the resulting clusters.

All the nodes in a cluster periodically sends heartbeats to each of its cluster peers. This is used to detect node failures, and to update pair-wise latency information for nodes. If a cluster leader fails or deliberately leaves the NICE group, a new leader is chosen by the same heuristic (minimize maximum latency from new center to any cluster peer). A new leader may also be chosen if the pair-wise latencies in the cluster drift sufficiently far to make selecting a new leader justified. Also, each member of every layer i periodically probes its latency to the nodes in layer i+1 — if the node is closer to another i+1 layer node, it moves to the corresponding cluster in layer i.


I didn’t see that the authors provided any grounds for choosing an appropriate k value (essentially the tree fan-in), which seems like it would be an important parameter.

1 Comment

Filed under Paper Summaries

“Internet Indirection Infrastructure”

Internet Indirection Infrastructure” (i3) proposes a communication abstraction in which applications send packets to identifiers, rather than network addresses. Receivers register triggers that express their interest in packets sent to a particular identifier— the i3 infrastructure takes care of the routing. i3 is designed to be a general-purpose substrate, which can be used to build communication patterns including multicast, anycast, load balanced server selection, and host mobility.

Basic Model

To register interest in data, a receiver registers a trigger for the pair (id, addr) — this specifies that packets sent to identifier id should be routed to IP address addr. Multiple receivers for a single ID can be registered, in which case the packet is delivered to all matching trigger addresses. i3 generalizes this basic model in a few ways:

  • Inexact matching. Identifiers are divided into two pieces: an “exact match prefix” of k bits, and a suffix that is used to do longest-prefix matches. That is, a trigger matches an identifier if the first k bits match exactly, and if no other trigger has a longer prefix match against the rest of the identifier.
  • Identifier stacks. Rather than sending a packet to a single identifier, packets can instead be sent to a stack of identifiers. Similarly, triggers are generalized to mean “When a match for id is found, push a, b, … onto the stack.” i3 looks for a trigger that matches the first element of the stack; if no such trigger is found, the first element is popped from the stack. This continues until the stack is empty or a matching trigger is found; if a match is found, the trigger’s identifier stack is pushed onto the packet’s stack, and routing continues anew. i3 assumes that this process will eventually result in matching a trigger against an identifier holding an IP address. When that occurs, the packet’s data and the rest of the identifier stack is sent to the address.
  • Private triggers. This is not an i3-level concept, but a design pattern that i3-using applications often follow. A “public” trigger is a trigger on a well-known identifier, and is used to initiate a session with a server; “private” triggers are used to communicate between pairs of end hosts. A pair of private trigger identifiers might be exchanged via a service’s public trigger identifier, and then subsequent communication will proceed via the private triggers.


i3 is implemented on top of a distributed hash table (DHT); the authors use the Chord DHT in the paper. The DHT provides a service to find the node associated with a key; for the purposes of i3, this allows a packet sender to find the node that holds the triggers for the first identifier in the packet’s identifier stack (the DHT lookup is only done on the “exact match prefix” of the identifier). The target i3 node then finds the matching trigger, if any, by doing a longest-prefix match against trigger identifiers. Once a match is found, the packet is routed to the receiver address or addresses via IP, or another DHT lookup occurs (if the trigger contains an ID stack, not an address).

Senders cache the IP address of the i3 node responsible for each identifier. Hence, in the common case, routing an i3 packet requires two physical network traversals: one to get from the sender to the i3 node, and then another from the i3 node to the receiver.

Example Usage

i3 provides direct support for simple multicast; anycast is supported using inexact identifier matching, and appropriate choices for the inexact suffix of the identifier. Host mobility can be supported by simply updating a receiver’s trigger. Multicast trees can be implemented by constructing routing trees in i3, using the identifier stack feature (triggers that “invoke” triggers can essentially be used to construct an arbitrary graph, and to do recursive queries over graphs).


In the paper’s experiments, the latency to route the first packet to an identifier is about 4 times that of raw IP routing, because a Chord lookup must be done (after applying some standard Chord tricks to try to pick the “closest” Chord node among the alternatives as the ring is traversed). To reduce the latency required to route subsequent packets, i3 employs two techniques:

  • Caching the address of the i3 node that owns a given identifier, as discussed above.
  • A receiver generates k random trigger identifiers, and then chooses the identifier that it can reach via IP with the lowest latency.

By having each receiver generate 15-30 random triggers with this technique, the authors are able to reduce the i3 routing latency for subsequent packets send to an identifier to between 2 and 1.5 times more than raw IP.


Overall, I liked this paper: it is an elegant idea that is explained well. A cynical view of this paper is that it isn’t all that innovative or “deep”: it is just a thin logic layer on top of a DHT. i3 is basically a generalization of the name-to-value lookup service provided by the DHT.

Both performance and security are serious issues with this design, and the paper didn’t really convince me that their approach addresses these problems.

I would have liked to see a performance comparison between IP-level multicast and anycast with implementations built on i3.

It would be interesting to compare i3 with group communication systems (GCS) that provide totally-ordered broadcast messages. What features could you add to i3 to make it easier to write reliable distributed systems?

Leave a comment

Filed under Paper Summaries