This paper describes the design of Chord, one of the earlier and most popular academic DHT systems (according to one measure, the Chord paper is the most-cited CS paper from 2001, along with a raft of other DHT papers). Chord’s design focuses on simplicity and offering provable performance and correctness properties. Chord provides a single service: given a key, it returns the address of the node storing the key. Other practical issues, like replication, caching, naming, and cryptographic authentication should be provided by an application level that runs on top of Chord.
In Chord, nodes and keys are arranged in a circle (the “identifier circle“). Each node maintains the address of the successor of that node; maintaining the correctness of these successor pointers is the critical correctness property of the protocol. Each key k is owned by the first node whose identifier (e.g. hash of IP address) follows k in the identifier circle. Therefore, keys can be located by simple sequential search; the search terminates when it reaches the node that ought to own the key.
To improve search performance, Chord also maintains a finger table. The finger table is essentially a skip list: if there are N nodes in the DHT, each finger table has O(log N) entries, holding the address of a node 1/2, 1/4, 1/8, … of the way “forward” from the current node in the identifier circle. This essentially allows the search procedure to perform a binary search of the identifier circle, to quickly approach the “neighorhood” of the target key. Note that this means that each node maintains more information about its local neighborhood than about distant nodes, which makes sense.
Joining A Chord Group
To join a Chord group, a joining node begins by computing its predecessor and finger tables. It then joins the group by notifying its successor node, which includes the new node as its predecessor. Each node runs a “stabilization” protocol to maintain the correctness of the successor pointers: each node n periodically asks whether the predecessor of its successor node is n; if not, a new node has joined the group, so n sets its successor to the new intermediate node, and notifies that node to make n the node’s predecessor. Once the successor pointers are updated, the new node has joined the Chord group, and can be accessed by subsequent lookup queries. In the background, keys that are now owned by n will be transferred from n‘s successor, and finger tables will be updated lazily: updating fingers promptly is not actually important for good performance, because fingers are used only to quickly reach the right “neighborhood.”
Handling Node Failure
If a node fails, the data on the node may be lost. This is not Chord’s concern, because it delegates responsibility for replication to the application. Instead, Chord must ensure that the Chord ring is not “broken”: lookup queries can continue, despite the fact that the predecessor of the failed node has an invalid successor node.
Chord achieves that by having each node maintain a list of successors, rather than a single one. If a node notices that its successor has failed, it simply tries the next node in the list. The paper argues that since the failure probabilities of elements of the successor list can be considered to be independent, using a sufficiently large successor list provides arbitrary protection against node failures. The stabilization protocol above is extended to apply to the entire list of successors, not just the first one.
Overall, this is a great paper, and it deserves its reputation as such. It also shows the benefit of combining good systems work with a top-flight theoretician (Karger).
Despite the paper’s formal tone, I was confused by their talk of “correctness”: they never actually define what their correctness criteria are. Their notion of “correctness” is apparently that the Chord ring remains live and can continue answering queries, even if the query answers are incorrect or undetected ring partitions occur. For example, they observe that when a join occurs and before stabilization finishes,
the nodes in the affected region [may] have incorrect successor pointers, or keys may not yet have migrated to newly joined nodes, and the lookup may fail. The higher-layer software using Chord will notice that the desired data was not found, and has the option of retrying the lookup after a pause.
That is quite disconcerting, and would be an impediment to using Chord-like systems as a backend for a reliable storage service. Importantly, many applications would like to distinguish between “lookup failed because of stabilization problems” and “lookup failed because the key was definitely not in the Chord ring.” As presented, the protocol does not appear to allow this distinction.
Another troubling aspect of the paper is the author’s attitude toward network partitions, which they regard as a “pathological case.” I thought this was surprising, because (short-lived) network partitions are not that rare. In fact, one of the explicit benefits of loose-consistency systems like DHTs is their ability to tolerate network partitions gracefully. They sketch a few ideas on how to detect and heal partitions, but don’t address this problem in any real depth.
The paper doesn’t talk about an obvious optimization: making “distance” in the Chord ring correlate with distance within the underlying physical network, to reduce communication costs. This is explored in depth by subsequent work.