Tag Archives: routing

“ExOr: Opportunistic Multi-Hop Routing for Wireless Networks”

Wired networks are traditionally viewed as a unicast medium: packets are sent from A to B. Because the underlying medium can have a degree of sharing (e.g. multiple hosts connected to the same Ethernet hub), steps must be taken to avoid interference between concurrent senders (e.g. CSMA).

In a wireless network, the medium is shared: any hosts in range of a radio transmission can receive it. If the goal is to use a traditional unicast routing protocol over a wireless link, this makes interference between senders a more challenging problem (e.g. as discussed in the MACAW paper). However, rather than viewing broadcast as an inconvenience we need to work around, the broadcast nature of wireless can also be leveraged to design new protocols that would be infeasible for a wired network.

A nice example of a broadcast-oriented protocol for wireless is “ExOr: Opportunistic Multi-Hop Routing for Wireless Networks“. In a traditional routing design, a sender chooses the next hop that should receive a packet, and then unicasts the packet to that destination. In ExOr, a sender broadcasts a batch of packets to a group of nodes simultaneously. The set of batch recipients coordinate to try to ensure that each packet in the batch is forwarded onward. The recipient of a packet is only chosen after the packet is sent, which allows ExOr to “opportunistically” take advantage of links that have high loss rates. ExOr assumes that node reception probabilities are mostly independent and mostly decrease with distance — both of which are probably pretty reasonable assumptions.


The source collects a batch of packets destined for the same host, and then chooses a forwarding list for the batch. The forwarding list is a list of nodes, sorted by the expected cost of delivering packets from that node to the eventual destination node. The cost metric is similar to ETX (expected number of transmissions required to send the packet to the destination via unicast, including retransmissions); unlike ETX, it only considers the forward delivery probability. While it would be possible to including all possible recipient nodes in the forwarding list, this would increase the coordination cost among the forwarders, so ExOr only includes “likely” recipients in the list (estimated 10%+ chance of receiving a broadcast packet).

Each packet contains a batch map, which holds the sender’s estimate of the highest priority (according to the cost metric) node to have received each packet in the batch. When a node receives a packet, it uses the packet’s batch map to update its local batch map. This means that batch map information propagates through the nodes, carrying information about packet reception from high priority nodes to lower priority nodes.

After the source broadcasts a batch, each member of the forwarding list broadcasts, ordered by descending priority (ETX value). Each node broadcasts the packets it received, along with its updated batch map. Nodes coordinate to schedule their transmissions to try to avoid interference (e.g. by estimating when each node’s expected transmission time is, according to ETX value and batch map contents). The protocol continues cycling through the nodes in priority order. At the end of each cycle, the ultimate destination broadcasts its batch map 10 times; at the beginning of each cycle, the source resends packets that weren’t received by any node (by observing batch map contents). The ExOr scheme stops when 90% of the packets in a batch have been transferred, and uses a traditional routing policy to deliver the remainder of the batch.


Overall, ExOr is a really neat idea. That said, it is at best a special-purpose solution, because of the high latency it incurs for batching and multiple transmission rounds. Similarly, the need for a split TCP proxy is pretty ugly. A complete solution to wireless routing would perhaps adaptively switch between latency-oriented and bandwidth-oriented routing techniques, depending on the nature of the traffic.

It seems arbitrary to me that ExOr works on a batch-by-batch basis—that almost seems like establishing a new TCP connection for each window’s worth of data. The amount of useful work done in each transmission round decreases, until the protocol imposes an arbitrary cutoff at 90% and switches to a traditional routing protocol. Instead, wouldn’t it be more sensible for ExOr to allow new packets to be inserted into the batch as the destination confirms the delivery of packets? This would essential emulate the “conservation of packets” principle from the Van Jacobson paper.


Filed under Paper Summaries


One of the challenges faced in networking research and education is the difficulty of accurately recreating a network environment. Production routers are implemented as hardware devices, but designing and fabricating a new hardware design for research or teaching is very expensive. Modular software routers like Click enable research and education on routing in software, but don’t allow experimentation with hardware and physical-layer issues.

NetFPGA aims to rectify this, by providing an easy-programmable hardware platform for network devices (NICs, switches, routers, etc.). Users interact with NetFPGA remotely, uploading programs that configure FPGAs on the device appropriately, and then observing the results. The initial version of NetFPGA didn’t have a CPU: instead, software to control the device runs on another computer, and accesses hardware registers remotely (using special Ethernet packets that are decoded by the NetFPGA board). The second version of NetFPGA has an optional hardware CPU.


This seems like a pretty incontrovertible “good idea.” One potential question is how the performance of an FPGA-based router compares to that of realistic production hardware — I’d suspect that NetFPGA-routers occupy a middle ground between software-only routers (easy to develop, poor performance) and production routers (difficult to program and expensive to fabricate, but good performance).

Leave a comment

Filed under Paper Summaries

“A Policy-Aware Switching Layer for Data Centers”

Many data center networks are configured as a single Layer 2 network, with the switches configured into a spanning tree to avoid routing loops (see the SEATTLE paper for some more background on how this works). However, in practice, simply finding a Layer 2 path to the appropriate destination host is not enough: data centers also include various “middleboxes” (e.g. firewalls, load balancers, and SSL offloaders) that intercept and sometimes modify Ethernet frames in transit. To ensure that traffic is routed to middleboxes as appropriate, network administrators typically modify Layer 2 path selection. This has a number of problems:

  • Correctness: How can the administrator be assured that frames always traverse the appropriate sequence of middleboxes, even in the face of network churn and switch failures?
  • Flexibility: Changing the middlebox routing policy should be easy to do.
  • Efficiency: Frames should not be routed to middleboxes that aren’t interested in those frames.

To address these problems, “A Policy-Aware Switching Layer for Data Centers” proposes that Layer 2 routing policies be specified in a declarative language at a centralized administrative location. These policies are compiled into then compiled into a set of rules that are installed onto “policy switches” (pswitches). Pswitches make Layer 2 routing decisions by consulting a policy table, rather than the “destination MAC => switch port” mapping table used in traditional Layer 2 switches. New policies can be deployed by simply modifying the centralized configuration, which then disseminates them to the appropriate pswitches.

Policies take the form “At location X, send frames like Y to Z”, where Z is a sequence of middlebox types. Writing policies in terms of middlebox types, rather than talking about individual middlebox instances, makes it easier to reason about the correctness and how to handle failures without violating the policy. Policies are compiled into sets of rules, which take the form “For frames that arrive from hop X like Y, send to hop Z” — this should admit an efficient implementation.

To avoid sending frames to middleboxes that are not interested in them, middleboxes are removed from the physical network data path—instead, pswitches explicitly forward frames to middleboxes as required. The authors argue that because data center interconnects are relatively low latency, this doesn’t have a major performance impact.


Making this idea work in practical networks requires addressing a lot of details. For example:

  • To support load balancers, the right-hand side of a policy can be a list of middlebox types. The pswitch uses a hash partitioning scheme that is careful to send all the packets belonging to either direction of a flow to the same destination.
  • To allow this work to be deployed without modifying the entire network, policy-based routing is implemented by encapsulating frames: to implement policy-based routing, a frame is encapsulated in another frame with a destination MAC of the next hop (middlebox or server) required by the policy. Frames are decapsulated before being delivered to middleboxes.


I like this paper. There are only two obvious concerns I can see: the deployment difficulties in any attempt to redesign how switches operate, and the performance concerns of (1) rule table lookups rather than normal Layer 2 routing (2) the additional latency of traversing middleboxes that are off the network data path. Compared with a traditional software-based router, their software-based pswitch implementation achieves 82% throughput and adds an additional 16% latency. Middleboxes deployed with this architecture suffer a lot more overhead: 40% throughput and twice the latency. Its hard to say how the performance of a hardware-based pswitch implementation would compare with regular hardware-based routers.

The authors argue that the additional hop required to traverse a middlebox is acceptable, because data center networks are typically low latency. This seems backward to me: data center applications might actually want to take advantage of low latency links (e.g. see RAMCloud), and won’t like the network infrastructure adding additional latency overhead without a compelling reason.

The idea of taking middleboxes off the data path seems orthogonal to the paper’s main contribution (which is the idea of a high-level specification of Layer 2 routing invariants). It might be nice to separate these two ideas more distinctly.

Leave a comment

Filed under Paper Summaries

“Internet Indirection Infrastructure”

Internet Indirection Infrastructure” (i3) proposes a communication abstraction in which applications send packets to identifiers, rather than network addresses. Receivers register triggers that express their interest in packets sent to a particular identifier— the i3 infrastructure takes care of the routing. i3 is designed to be a general-purpose substrate, which can be used to build communication patterns including multicast, anycast, load balanced server selection, and host mobility.

Basic Model

To register interest in data, a receiver registers a trigger for the pair (id, addr) — this specifies that packets sent to identifier id should be routed to IP address addr. Multiple receivers for a single ID can be registered, in which case the packet is delivered to all matching trigger addresses. i3 generalizes this basic model in a few ways:

  • Inexact matching. Identifiers are divided into two pieces: an “exact match prefix” of k bits, and a suffix that is used to do longest-prefix matches. That is, a trigger matches an identifier if the first k bits match exactly, and if no other trigger has a longer prefix match against the rest of the identifier.
  • Identifier stacks. Rather than sending a packet to a single identifier, packets can instead be sent to a stack of identifiers. Similarly, triggers are generalized to mean “When a match for id is found, push a, b, … onto the stack.” i3 looks for a trigger that matches the first element of the stack; if no such trigger is found, the first element is popped from the stack. This continues until the stack is empty or a matching trigger is found; if a match is found, the trigger’s identifier stack is pushed onto the packet’s stack, and routing continues anew. i3 assumes that this process will eventually result in matching a trigger against an identifier holding an IP address. When that occurs, the packet’s data and the rest of the identifier stack is sent to the address.
  • Private triggers. This is not an i3-level concept, but a design pattern that i3-using applications often follow. A “public” trigger is a trigger on a well-known identifier, and is used to initiate a session with a server; “private” triggers are used to communicate between pairs of end hosts. A pair of private trigger identifiers might be exchanged via a service’s public trigger identifier, and then subsequent communication will proceed via the private triggers.


i3 is implemented on top of a distributed hash table (DHT); the authors use the Chord DHT in the paper. The DHT provides a service to find the node associated with a key; for the purposes of i3, this allows a packet sender to find the node that holds the triggers for the first identifier in the packet’s identifier stack (the DHT lookup is only done on the “exact match prefix” of the identifier). The target i3 node then finds the matching trigger, if any, by doing a longest-prefix match against trigger identifiers. Once a match is found, the packet is routed to the receiver address or addresses via IP, or another DHT lookup occurs (if the trigger contains an ID stack, not an address).

Senders cache the IP address of the i3 node responsible for each identifier. Hence, in the common case, routing an i3 packet requires two physical network traversals: one to get from the sender to the i3 node, and then another from the i3 node to the receiver.

Example Usage

i3 provides direct support for simple multicast; anycast is supported using inexact identifier matching, and appropriate choices for the inexact suffix of the identifier. Host mobility can be supported by simply updating a receiver’s trigger. Multicast trees can be implemented by constructing routing trees in i3, using the identifier stack feature (triggers that “invoke” triggers can essentially be used to construct an arbitrary graph, and to do recursive queries over graphs).


In the paper’s experiments, the latency to route the first packet to an identifier is about 4 times that of raw IP routing, because a Chord lookup must be done (after applying some standard Chord tricks to try to pick the “closest” Chord node among the alternatives as the ring is traversed). To reduce the latency required to route subsequent packets, i3 employs two techniques:

  • Caching the address of the i3 node that owns a given identifier, as discussed above.
  • A receiver generates k random trigger identifiers, and then chooses the identifier that it can reach via IP with the lowest latency.

By having each receiver generate 15-30 random triggers with this technique, the authors are able to reduce the i3 routing latency for subsequent packets send to an identifier to between 2 and 1.5 times more than raw IP.


Overall, I liked this paper: it is an elegant idea that is explained well. A cynical view of this paper is that it isn’t all that innovative or “deep”: it is just a thin logic layer on top of a DHT. i3 is basically a generalization of the name-to-value lookup service provided by the DHT.

Both performance and security are serious issues with this design, and the paper didn’t really convince me that their approach addresses these problems.

I would have liked to see a performance comparison between IP-level multicast and anycast with implementations built on i3.

It would be interesting to compare i3 with group communication systems (GCS) that provide totally-ordered broadcast messages. What features could you add to i3 to make it easier to write reliable distributed systems?

Leave a comment

Filed under Paper Summaries

“Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications”

This paper describes the design of Chord, one of the earlier and most popular academic DHT systems (according to one measure, the Chord paper is the most-cited CS paper from 2001, along with a raft of other DHT papers). Chord’s design focuses on simplicity and offering provable performance and correctness properties. Chord provides a single service: given a key, it returns the address of the node storing the key. Other practical issues, like replication, caching, naming, and cryptographic authentication should be provided by an application level that runs on top of Chord.

Basic Protocol

In Chord, nodes and keys are arranged in a circle (the “identifier circle“). Each node maintains the address of the successor of that node; maintaining the correctness of these successor pointers is the critical correctness property of the protocol. Each key k is owned by the first node whose identifier (e.g. hash of IP address) follows k in the identifier circle. Therefore, keys can be located by simple sequential search; the search terminates when it reaches the node that ought to own the key.

To improve search performance, Chord also maintains a finger table. The finger table is essentially a skip list: if there are N nodes in the DHT, each finger table has O(log N) entries, holding the address of a node 1/2, 1/4, 1/8, … of the way “forward” from the current node in the identifier circle. This essentially allows the search procedure to perform a binary search of the identifier circle, to quickly approach the “neighorhood” of the target key. Note that this means that each node maintains more information about its local neighborhood than about distant nodes, which makes sense.

Joining A Chord Group

To join a Chord group, a joining node begins by computing its predecessor and finger tables. It then joins the group by notifying its successor node, which includes the new node as its predecessor. Each node runs a “stabilization” protocol to maintain the correctness of the successor pointers: each node n periodically asks whether the predecessor of its successor node is n; if not, a new node has joined the group, so n sets its successor to the new intermediate node, and notifies that node to make n the node’s predecessor. Once the successor pointers are updated, the new node has joined the Chord group, and can be accessed by subsequent lookup queries. In the background, keys that are now owned by n will be transferred from n‘s successor, and finger tables will be updated lazily: updating fingers promptly is not actually important for good performance, because fingers are used only to quickly reach the right “neighborhood.”

Handling Node Failure

If a node fails, the data on the node may be lost. This is not Chord’s concern, because it delegates responsibility for replication to the application. Instead, Chord must ensure that the Chord ring is not “broken”: lookup queries can continue, despite the fact that the predecessor of the failed node has an invalid successor node.

Chord achieves that by having each node maintain a list of successors, rather than a single one. If a node notices that its successor has failed, it simply tries the next node in the list. The paper argues that since the failure probabilities of elements of the successor list can be considered to be independent, using a sufficiently large successor list provides arbitrary protection against node failures. The stabilization protocol above is extended to apply to the entire list of successors, not just the first one.


Overall, this is a great paper, and it deserves its reputation as such. It also shows the benefit of combining good systems work with a top-flight theoretician (Karger).

Despite the paper’s formal tone, I was confused by their talk of “correctness”: they never actually define what their correctness criteria are. Their notion of “correctness” is apparently that the Chord ring remains live and can continue answering queries, even if the query answers are incorrect or undetected ring partitions occur. For example, they observe that when a join occurs and before stabilization finishes,

the nodes in the affected region [may] have incorrect successor pointers, or keys may not yet have migrated to newly joined nodes, and the lookup may fail. The higher-layer software using Chord will notice that the desired data was not found, and has the option of retrying the lookup after a pause.

That is quite disconcerting, and would be an impediment to using Chord-like systems as a backend for a reliable storage service. Importantly, many applications would like to distinguish between “lookup failed because of stabilization problems” and “lookup failed because the key was definitely not in the Chord ring.” As presented, the protocol does not appear to allow this distinction.

Another troubling aspect of the paper is the author’s attitude toward network partitions, which they regard as a “pathological case.” I thought this was surprising, because (short-lived) network partitions are not that rare. In fact, one of the explicit benefits of loose-consistency systems like DHTs is their ability to tolerate network partitions gracefully. They sketch a few ideas on how to detect and heal partitions, but don’t address this problem in any real depth.

The paper doesn’t talk about an obvious optimization: making “distance” in the Chord ring correlate with distance within the underlying physical network, to reduce communication costs. This is explored in depth by subsequent work.

1 Comment

Filed under Paper Summaries

“A Path Metric for Multi-Hop Wireless Routing”

A High-Throughput Path Metric for Multi-Hop Wireless Routing” presents the expected transmission count (ETX) metric for finding high-throughput paths on multi-hop wireless networks. By considering link loss rates, asymmetrical loss ratios between the directions of a link, and interference, the paper argues that ETX substantially outperforms routing protocols based on finding the shortest path (fewest hops) in an ad-hoc wireless network.


There are several shortcomings with using minimum hop count for routing in ad-hoc wireless networks:

  • Shortest-path routing protocols essentially assume that either a link works (probe packets are delivered successfully), or it doesn’t. While wireless networks can be roughly classified in this manner, wireless links have a much richer space of performance and loss characteristics than “it works” or “it doesn’t”.
  • Minimizing the hop count typically means maximizing the physical distance traveled by each hop, which leads to selecting links with weak signals and high loss rates. As the related RoofNet paper discusses, it is often preferable to utilize a path of short, high-quality links over a single long-distance, low-quality link.
  • In a dense ad-hoc network, there may be many different routes with the same number of hops. Deciding between routes on the basis of hop count is therefore not sufficient in any case.

There are problems with two naive alternatives to min-hop-count:

  • An improved routing metric could simply choose the path by maximizing the product of the per-link delivery ratios. This is inadequate, however, because it would result in preferring a two-hop route with perfect delivery to a one-hop route with a low loss rate — the latter route is actually better in many cases.
  • Choosing links on the basis of end-to-end delay is problematic, because of oscillation: once a route is utilized for traffic, its delay will often increase. This will result in the routing metric choosing another path, whose latency will then increase, causing it to switch back to the first path, and so on.

ETX Metric Design

The ETX metric for a link is the expected number of data transmissions required to send a packet of a certain size over the link. The ETX metric for a path is the sum of the per-link ETX metrics. ETX calculates per-link delivery ratios for both directions of a link (transmit and receive), to take into account the asymmetrical nature of wireless communication. Delivery ratios are calculated by periodic probes of a fixed size (sent once per second by default via broadcast); each node calculates per-link delivery ratios based on the most recent 10 seconds of activity.

ETX assumes that radios have a fixed transmit speed, and that congestion is not a factor. Neither assumption is very realistic: variable speed radios are now pervasive, and congestion might well be a factor in ad-hoc networks with sizable user bases. ETX also assumes that delivery rates are the same regardless of packet size, which is probably inaccurate (e.g. ACKs are small and take less time to transfer).


I was surprised that the paper spent so much time comparing ETX with min-hop-count metrics: it seems quite straightforward that min-hop-count would fair pretty poorly in an ad-hoc wireless network. In that sense, performance comparisons between ETX and min-hop-count are also not that informative: of course ETX will beat min-hop-count, because the latter is simply a terrible choice for this environment. It seems surprising that min-hop-count is truly the best example of prior work they could measure against — why not compare with the naive “maximize product of per-link delivery ratio” scheme, for example?

I didn’t get that much out of this paper: the RoofNet paper already summarizes an improved version of the ETX metric. Because the ETX evaluation compares it with such a pessimal alternative (min-hop-count), I didn’t get much out of the evaluation in this paper.


Filed under Paper Summaries

“Congestion Control for High Bandwidth-Delay Product Networks”

Most of the papers we’ve read on improving congestion behavior have an “incremental” flavor: they begin with traditional TCP and router behavior, and then suggest a conservative change that addresses some particular problem (e.g. slow start, better RTO estimation, fast retransmit, fair queueing, etc.). In contrast, this paper takes a “clean slate” approach:

Our initial objective is to step back and rethink Internet congestion control without caring about backward compatibility or deployment. If we were to build a new congestion control architecture from scratch, what might it look like?

Efficiency and fairness

In TCP, efficiency and fairness are coupled, because TCP uses a single procedure (“additive increase, multiplicative decrease“) to try to achieve both goals. The paper argues that these goals should really be treated separately: efficiency describes how close the link’s aggregate throughput approaches the link capacity, whereas fairness concerns how the current link throughput is distributed among the flows passing through a link.

Therefore, XCP separates these two goals: an efficiency controller is responsible for deciding whether to increase or decrease the link’s aggregate throughput, and a fairness controller is tasked with translating that aggregate feedback into per-flow feedback in order to achieve or maintain fairness. Separating these two components makes it easier to analyze their behavior (e.g. the efficiency controller doesn’t need to account for the number of flows), and allows new component designs to be considered in isolation.

The paper proposes using a multiplicative-increase, multiplicative-decrease (MIMD) for the efficiency controller, and a traditional additive-increase, multiplicative decrease (AIMD) scheme for the fairness controller. The paper argues that using MIMD for efficiency is important, because the AIMD policy traditionally used by TCP takes a long time to reach full capacity on links with a high bandwidth-delay product.

Explicit congestion signals

XCP moves away from packet drops as an implicit signal of network conditions, for a few pretty convincing reasons:

  • Congestion is not the only source of loss, so clients cannot distinguish between losses due to errors and losses due to congestion.
  • Detecting packet loss requires waiting for the RTO timeout, duplicate ACKs, or similar — the extra delay incurred makes clients respond more slowly to congestion signals, which makes the network more unstable. This also means clients must be more conservative when probing for network capacity, because the penalty for “overshooting” is more severe.
  • Packet loss is a binary signal; an explicit congestion notification can carry more information about the state of the network.

That makes a lot of sense, but it seems quite obvious and not really novel: ECN for IP has been around for a long time, and DECNet did something very similar years before this paper was written. I would guess the slow deployment of ECN is mostly related to pragmatism (e.g. support for ECN in IP stacks and routers), which developing a completely new transport protocol would only exacerbate.


In XCP, senders attach a “congestion header” to each outgoing packet. This contains the sender’s current congestion window size, RTT estimate, and the sender’s desired window size increase (“H_feedback”). H_feedback is modified by XCP routers, and the updated value is returned in the ACK for a sent packet. The feedback value instructs senders to either increase or decrease their congestion windows.

In an XCP router, the efficiency controller first tries to adjust the aggregate link throughput to approach full utilization. The output of the efficiency controller is then fed to the fairness controller, which translates the global increase/decrease signal into feedback for individual senders, in such a way that fairness is increased or maintained.


The feasibility of separating fairness and efficiency hinges on whether these two goals really are independent: in practice, can the efficiency controller be replaced without any consideration of the fairness controller, and vice versa? I was initially skeptical that such a clean separation would be possible, but I think the paper’s scheme is pretty compelling.

Part of the paper’s motivation is the idea that networks with high bandwidth-delay will become increasingly common in the future. If we ignore special-cases like satellite links, then high-bandwidth networks are a certainty, but is it true that latency improvements won’t match bandwidth? I suppose latency is fundamentally limited by the speed of light in a way that bandwidth is not. “Latency lags bandwidth” is an interesting paper from Dave Patterson that examines this trend in general, and argues that latency will continue to lag bandwidth in the future.

Given the enormous administrative cost of deploying a new transport protocol, one wonders if the goals identified by this paper can’t be reached more easily by incremental changes to TCP, rather than a clean-slate redesign. XCP just doesn’t seem fundamentally different enough to warrant throwing out TCP.

1 Comment

Filed under Paper Summaries