Tag Archives: overlay network

“Scalable Application Layer Multicast”

Multicast is clearly an efficient technique for applications with one-to-many communication patterns, but the deployment of in-network multicast has been slow. Therefore, there have been a number of proposals for implementing multicast at the application level, as an overlay over the physical network. This is not as efficient as true in-network multicast (because the same message may be sent over the same link multiple times), but is much more flexible and easier to deploy.

Scalable Application Layer Multicast” is one such proposal for application-layer multicast via an overlay network; their proposed protocol is called NICE. Their focus is on applications that require low-latency delivery of relatively low-bandwidth data streams to a large set of recipients, although they argue that their techniques are also applicable to high-volume streams with some minor changes.

Network Topology

In NICE, nodes are arranged into layers; a single node can appear in more than one layer. Every node belongs to the lowest layer, L0. The nodes in a layer are arranged (automatically) into clusters of nodes that are “near” to one another (in terms of network distance/latency). Each cluster has a leader, which is the node in the “center” of the cluster (NICE tries to make the leader the node that has the smallest maximal latency to the other nodes in the cluster). Layer L1 consists of all the cluster leaders from L0; the clustering algorithm is applied to L1 in turn, yielding another set of cluster leaders which form L2, and so forth. The height of the tree is determined by the number of nodes and a constant k: each cluster contains between k and 3k-1 nodes.

To multicast a data message, a node forwards the message to every cluster peer in every layer in which the node belongs, except that a node never forwards a message back to the message’s previous hop.


To join the multicast group, a node begins by contacting a designated node called the Rendezvous Point (RP). The RP is typically the root of the NICE tree. The joining node walks down the tree from the root, choosing the child node that is closest to it (lowest latency).

Cluster leaders periodically check whether the cluster size constraint (k <= size <= 3k-1) has been violated; if so, they initiate a cluster merge or split, as appropriate. Splitting a cluster into two clusters is done by trying to minimize the maximum of the radii of the resulting clusters.

All the nodes in a cluster periodically sends heartbeats to each of its cluster peers. This is used to detect node failures, and to update pair-wise latency information for nodes. If a cluster leader fails or deliberately leaves the NICE group, a new leader is chosen by the same heuristic (minimize maximum latency from new center to any cluster peer). A new leader may also be chosen if the pair-wise latencies in the cluster drift sufficiently far to make selecting a new leader justified. Also, each member of every layer i periodically probes its latency to the nodes in layer i+1 — if the node is closer to another i+1 layer node, it moves to the corresponding cluster in layer i.


I didn’t see that the authors provided any grounds for choosing an appropriate k value (essentially the tree fan-in), which seems like it would be an important parameter.

1 Comment

Filed under Paper Summaries

“Internet Indirection Infrastructure”

Internet Indirection Infrastructure” (i3) proposes a communication abstraction in which applications send packets to identifiers, rather than network addresses. Receivers register triggers that express their interest in packets sent to a particular identifier— the i3 infrastructure takes care of the routing. i3 is designed to be a general-purpose substrate, which can be used to build communication patterns including multicast, anycast, load balanced server selection, and host mobility.

Basic Model

To register interest in data, a receiver registers a trigger for the pair (id, addr) — this specifies that packets sent to identifier id should be routed to IP address addr. Multiple receivers for a single ID can be registered, in which case the packet is delivered to all matching trigger addresses. i3 generalizes this basic model in a few ways:

  • Inexact matching. Identifiers are divided into two pieces: an “exact match prefix” of k bits, and a suffix that is used to do longest-prefix matches. That is, a trigger matches an identifier if the first k bits match exactly, and if no other trigger has a longer prefix match against the rest of the identifier.
  • Identifier stacks. Rather than sending a packet to a single identifier, packets can instead be sent to a stack of identifiers. Similarly, triggers are generalized to mean “When a match for id is found, push a, b, … onto the stack.” i3 looks for a trigger that matches the first element of the stack; if no such trigger is found, the first element is popped from the stack. This continues until the stack is empty or a matching trigger is found; if a match is found, the trigger’s identifier stack is pushed onto the packet’s stack, and routing continues anew. i3 assumes that this process will eventually result in matching a trigger against an identifier holding an IP address. When that occurs, the packet’s data and the rest of the identifier stack is sent to the address.
  • Private triggers. This is not an i3-level concept, but a design pattern that i3-using applications often follow. A “public” trigger is a trigger on a well-known identifier, and is used to initiate a session with a server; “private” triggers are used to communicate between pairs of end hosts. A pair of private trigger identifiers might be exchanged via a service’s public trigger identifier, and then subsequent communication will proceed via the private triggers.


i3 is implemented on top of a distributed hash table (DHT); the authors use the Chord DHT in the paper. The DHT provides a service to find the node associated with a key; for the purposes of i3, this allows a packet sender to find the node that holds the triggers for the first identifier in the packet’s identifier stack (the DHT lookup is only done on the “exact match prefix” of the identifier). The target i3 node then finds the matching trigger, if any, by doing a longest-prefix match against trigger identifiers. Once a match is found, the packet is routed to the receiver address or addresses via IP, or another DHT lookup occurs (if the trigger contains an ID stack, not an address).

Senders cache the IP address of the i3 node responsible for each identifier. Hence, in the common case, routing an i3 packet requires two physical network traversals: one to get from the sender to the i3 node, and then another from the i3 node to the receiver.

Example Usage

i3 provides direct support for simple multicast; anycast is supported using inexact identifier matching, and appropriate choices for the inexact suffix of the identifier. Host mobility can be supported by simply updating a receiver’s trigger. Multicast trees can be implemented by constructing routing trees in i3, using the identifier stack feature (triggers that “invoke” triggers can essentially be used to construct an arbitrary graph, and to do recursive queries over graphs).


In the paper’s experiments, the latency to route the first packet to an identifier is about 4 times that of raw IP routing, because a Chord lookup must be done (after applying some standard Chord tricks to try to pick the “closest” Chord node among the alternatives as the ring is traversed). To reduce the latency required to route subsequent packets, i3 employs two techniques:

  • Caching the address of the i3 node that owns a given identifier, as discussed above.
  • A receiver generates k random trigger identifiers, and then chooses the identifier that it can reach via IP with the lowest latency.

By having each receiver generate 15-30 random triggers with this technique, the authors are able to reduce the i3 routing latency for subsequent packets send to an identifier to between 2 and 1.5 times more than raw IP.


Overall, I liked this paper: it is an elegant idea that is explained well. A cynical view of this paper is that it isn’t all that innovative or “deep”: it is just a thin logic layer on top of a DHT. i3 is basically a generalization of the name-to-value lookup service provided by the DHT.

Both performance and security are serious issues with this design, and the paper didn’t really convince me that their approach addresses these problems.

I would have liked to see a performance comparison between IP-level multicast and anycast with implementations built on i3.

It would be interesting to compare i3 with group communication systems (GCS) that provide totally-ordered broadcast messages. What features could you add to i3 to make it easier to write reliable distributed systems?

Leave a comment

Filed under Paper Summaries

“The Development of the Domain Name System”

The Development of the Domain Name System” describes the basic design of the DNS system, the motivations and design principles employed, and the lessons learned by the designers of the system.

DNS Design

DNS is a distributed, hierarchical database that maps domain names to values. The values associated with a domain name are a collection of resource records (RRs), each with a well-known type (e.g. host IP address). RRs also have a “class,” which specifies the “protocol family” of the RR (e.g. Internet). Given the dominance of IP, I’d expect classes to not be used in the modern DNS system.

DNS names are arranged into a variable-depth tree. The name of each node is given by concatenating the labels of the path from a node to the root, and separating the labels with periods. The tree is divided into “zones,” which are each controlled by a different organization. Each zone is a contiguous region of the tree, although a zone is typically a simple subtree. Administrative authority flows from the root of the tree to the leaves: the root zone controls the set of top-level domains (e.g. COM, NET), which in turn control the contents of their subtree. This allows individual organizations to administer portions of the tree autonomously.

DNS is composed of name servers, which host DNS data, and resolvers, which query the system to resolve domain names. Resolvers can either be individual client machines (e.g. part of libc), or a separate organization-wide resolver service (that might be combined with the organization’s name server). Resolvers can cache DNS lookups, to reduce resolution traffic (particularly on the servers that host the higher levels of the tree). Each DNS record has a “TTL” value that specifies how long it can be cached for. DNS resolvers typically use UDP, rather than TCP, which avoids the need for a TCP handshake before an address can be resolved.


Modern DNS is insecure, arcane, and overly complex. Why? In practice, it seems that the focus on “leanness” described in the paper hasn’t produced a simple and minimalistic system.

I wonder if the focus on extensibility in the DNS design is justified. In practice, DNS is a system for mapping host names to IP addresses, with some secondary functionality like maintaining various (often-inaccurate) administrative information about host names and the MX feature. While extensibility may have been important when the intended use of the system was unclear, it doesn’t seem to be a big win in the modern Internet — a simpler and more constrained design might be a better fit for modern DNS usage. Foregoing extensibility and constraining the core DNS functionality to be a key-value lookup service might have helped to simplify the system.

The lack of any consideration of security is notable, given DNS’s shabby security record.

The paper explains that DNS could have represented multiple resource records of the same type with a single multi-valued record. Their argument for using multiple records is:

The space efficiency of the single RR with multiple values was attractive, but the multiple RR option cut down the maximum RR size. This appeared to promise simpler dynamic update protocols, and also seemed suited to use in a limited-size datagram environment.

I think this is a backwards way to design systems, because it confuses physical layout decisions (e.g. space efficiency of storage) with logical data format decisions (e.g. whether to use multiple “rows” of the same type, or a single row with an array value).

The paper notes that in 1983, root name servers typically processed one query per second (although clients still observed poor response times, 500 milliseconds to 5 seconds in some cases). I couldn’t find much data on the query rates of modern root servers, but the website for the K root server has a graph of query rates which suggests the average load is 15,000-20,000 queries per second. Multiplied by 13 root servers, that is a significant query load (although note that the 13 “root servers” are hosted by far more than 13 physical machines, using techniques like anycast).

Related Reading

Paul Vixie’s article “What DNS Is Not” decries “innovators” who “misuse” the DNS system. He particularly dislikes using IP-based geolocation to use DNS to implement CDNs, for instance.

Leave a comment

Filed under Paper Summaries

“Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications”

This paper describes the design of Chord, one of the earlier and most popular academic DHT systems (according to one measure, the Chord paper is the most-cited CS paper from 2001, along with a raft of other DHT papers). Chord’s design focuses on simplicity and offering provable performance and correctness properties. Chord provides a single service: given a key, it returns the address of the node storing the key. Other practical issues, like replication, caching, naming, and cryptographic authentication should be provided by an application level that runs on top of Chord.

Basic Protocol

In Chord, nodes and keys are arranged in a circle (the “identifier circle“). Each node maintains the address of the successor of that node; maintaining the correctness of these successor pointers is the critical correctness property of the protocol. Each key k is owned by the first node whose identifier (e.g. hash of IP address) follows k in the identifier circle. Therefore, keys can be located by simple sequential search; the search terminates when it reaches the node that ought to own the key.

To improve search performance, Chord also maintains a finger table. The finger table is essentially a skip list: if there are N nodes in the DHT, each finger table has O(log N) entries, holding the address of a node 1/2, 1/4, 1/8, … of the way “forward” from the current node in the identifier circle. This essentially allows the search procedure to perform a binary search of the identifier circle, to quickly approach the “neighorhood” of the target key. Note that this means that each node maintains more information about its local neighborhood than about distant nodes, which makes sense.

Joining A Chord Group

To join a Chord group, a joining node begins by computing its predecessor and finger tables. It then joins the group by notifying its successor node, which includes the new node as its predecessor. Each node runs a “stabilization” protocol to maintain the correctness of the successor pointers: each node n periodically asks whether the predecessor of its successor node is n; if not, a new node has joined the group, so n sets its successor to the new intermediate node, and notifies that node to make n the node’s predecessor. Once the successor pointers are updated, the new node has joined the Chord group, and can be accessed by subsequent lookup queries. In the background, keys that are now owned by n will be transferred from n‘s successor, and finger tables will be updated lazily: updating fingers promptly is not actually important for good performance, because fingers are used only to quickly reach the right “neighborhood.”

Handling Node Failure

If a node fails, the data on the node may be lost. This is not Chord’s concern, because it delegates responsibility for replication to the application. Instead, Chord must ensure that the Chord ring is not “broken”: lookup queries can continue, despite the fact that the predecessor of the failed node has an invalid successor node.

Chord achieves that by having each node maintain a list of successors, rather than a single one. If a node notices that its successor has failed, it simply tries the next node in the list. The paper argues that since the failure probabilities of elements of the successor list can be considered to be independent, using a sufficiently large successor list provides arbitrary protection against node failures. The stabilization protocol above is extended to apply to the entire list of successors, not just the first one.


Overall, this is a great paper, and it deserves its reputation as such. It also shows the benefit of combining good systems work with a top-flight theoretician (Karger).

Despite the paper’s formal tone, I was confused by their talk of “correctness”: they never actually define what their correctness criteria are. Their notion of “correctness” is apparently that the Chord ring remains live and can continue answering queries, even if the query answers are incorrect or undetected ring partitions occur. For example, they observe that when a join occurs and before stabilization finishes,

the nodes in the affected region [may] have incorrect successor pointers, or keys may not yet have migrated to newly joined nodes, and the lookup may fail. The higher-layer software using Chord will notice that the desired data was not found, and has the option of retrying the lookup after a pause.

That is quite disconcerting, and would be an impediment to using Chord-like systems as a backend for a reliable storage service. Importantly, many applications would like to distinguish between “lookup failed because of stabilization problems” and “lookup failed because the key was definitely not in the Chord ring.” As presented, the protocol does not appear to allow this distinction.

Another troubling aspect of the paper is the author’s attitude toward network partitions, which they regard as a “pathological case.” I thought this was surprising, because (short-lived) network partitions are not that rare. In fact, one of the explicit benefits of loose-consistency systems like DHTs is their ability to tolerate network partitions gracefully. They sketch a few ideas on how to detect and heal partitions, but don’t address this problem in any real depth.

The paper doesn’t talk about an obvious optimization: making “distance” in the Chord ring correlate with distance within the underlying physical network, to reduce communication costs. This is explored in depth by subsequent work.

1 Comment

Filed under Paper Summaries

“Looking Up Data In P2P Systems”

Looking Up Data In P2P Systems” is a CACM article that provides a brief, high-level overview of research on DHT systems in the 2003 timeframe. Their focus is on Chord, Pastry, and similar systems, that implement a DHT via an overlay network, and are primarily intended for WAN environments.

The paper focuses on the “lookup problem“: searching for a data item in a distributed system without using centralized servers or a hierarchical server structure. The paper’s argument is that requiring centralized or privileged nodes inhibits scalability, because it load certain servers more than others. To avoid the need for centralized resources, P2P lookup systems typically rely on mapping lookup keys into a numeric ID space, typically via hashing. The lookup protocol then uses the local “neighbor” information cached at each node to successively navigate toward nodes that hold keys in the ID space that are “closer” to the lookup key, using a distance metric defined by each protocol. An important real-world optimization is to make the P2P system’s notion of “distance” correspond roughly to network distance, to reduce the number of physical network hops that must be traversed to perform a lookup. P2P lookup systems differ in the amount of state they keep about their neighbors, how the space of nodes is arranged (1-D ring, 2-D space, etc.), the distance metric used, whether the system attempts to protect against malicious nodes joining the group, and so forth.


P2P systems clearly have a role to play, but the hype around this topic has died down considerably since 2003. Modern DHTs are typically employed inside a handful of datacenters, and hence use different algorithms than the WAN-oriented protocols discussed by this paper (e.g. Dynamo is a well-known DHT for datacenter use designed by Amazon).

The problem with treating all the nodes in a P2P system as true “peers” is that it assumes that each node has approximately equal connectivity and resources. This is unlikely to be true, especially in a WAN environment. Hence the use of “SuperNodes” in Kazaa, and other practical systems: to achieve good performance, a P2P system should take advantage of nodes with better connectivity and resources. This amounts to constructing a hierarchical arrangement on-the-fly, but there’s nothing inherently “unscalable” about that. It is interesting that this line of P2P work explores the extreme “decentralized / symmetric” end of the spectrum, but I think most practical systems will incorporate some degree of asymmetry.

Related Reading

The Surprising Persistence of Peer-To-Peer discusses the author’s view that P2P research has remaining “surprisingly” relevant and active in recent years.


Filed under Paper Summaries