Tag Archives: routing

“ExOr: Opportunistic Multi-Hop Routing for Wireless Networks”

Wired networks are traditionally viewed as a unicast medium: packets are sent from A to B. Because the underlying medium can have a degree of sharing (e.g. multiple hosts connected to the same Ethernet hub), steps must be taken to avoid interference between concurrent senders (e.g. CSMA).

In a wireless network, the medium is shared: any hosts in range of a radio transmission can receive it. If the goal is to use a traditional unicast routing protocol over a wireless link, this makes interference between senders a more challenging problem (e.g. as discussed in the MACAW paper). However, rather than viewing broadcast as an inconvenience we need to work around, the broadcast nature of wireless can also be leveraged to design new protocols that would be infeasible for a wired network.

A nice example of a broadcast-oriented protocol for wireless is “ExOr: Opportunistic Multi-Hop Routing for Wireless Networks“. In a traditional routing design, a sender chooses the next hop that should receive a packet, and then unicasts the packet to that destination. In ExOr, a sender broadcasts a batch of packets to a group of nodes simultaneously. The set of batch recipients coordinate to try to ensure that each packet in the batch is forwarded onward. The recipient of a packet is only chosen after the packet is sent, which allows ExOr to “opportunistically” take advantage of links that have high loss rates. ExOr assumes that node reception probabilities are mostly independent and mostly decrease with distance — both of which are probably pretty reasonable assumptions.


The source collects a batch of packets destined for the same host, and then chooses a forwarding list for the batch. The forwarding list is a list of nodes, sorted by the expected cost of delivering packets from that node to the eventual destination node. The cost metric is similar to ETX (expected number of transmissions required to send the packet to the destination via unicast, including retransmissions); unlike ETX, it only considers the forward delivery probability. While it would be possible to including all possible recipient nodes in the forwarding list, this would increase the coordination cost among the forwarders, so ExOr only includes “likely” recipients in the list (estimated 10%+ chance of receiving a broadcast packet).

Each packet contains a batch map, which holds the sender’s estimate of the highest priority (according to the cost metric) node to have received each packet in the batch. When a node receives a packet, it uses the packet’s batch map to update its local batch map. This means that batch map information propagates through the nodes, carrying information about packet reception from high priority nodes to lower priority nodes.

After the source broadcasts a batch, each member of the forwarding list broadcasts, ordered by descending priority (ETX value). Each node broadcasts the packets it received, along with its updated batch map. Nodes coordinate to schedule their transmissions to try to avoid interference (e.g. by estimating when each node’s expected transmission time is, according to ETX value and batch map contents). The protocol continues cycling through the nodes in priority order. At the end of each cycle, the ultimate destination broadcasts its batch map 10 times; at the beginning of each cycle, the source resends packets that weren’t received by any node (by observing batch map contents). The ExOr scheme stops when 90% of the packets in a batch have been transferred, and uses a traditional routing policy to deliver the remainder of the batch.


Overall, ExOr is a really neat idea. That said, it is at best a special-purpose solution, because of the high latency it incurs for batching and multiple transmission rounds. Similarly, the need for a split TCP proxy is pretty ugly. A complete solution to wireless routing would perhaps adaptively switch between latency-oriented and bandwidth-oriented routing techniques, depending on the nature of the traffic.

It seems arbitrary to me that ExOr works on a batch-by-batch basis—that almost seems like establishing a new TCP connection for each window’s worth of data. The amount of useful work done in each transmission round decreases, until the protocol imposes an arbitrary cutoff at 90% and switches to a traditional routing protocol. Instead, wouldn’t it be more sensible for ExOr to allow new packets to be inserted into the batch as the destination confirms the delivery of packets? This would essential emulate the “conservation of packets” principle from the Van Jacobson paper.


Filed under Paper Summaries


One of the challenges faced in networking research and education is the difficulty of accurately recreating a network environment. Production routers are implemented as hardware devices, but designing and fabricating a new hardware design for research or teaching is very expensive. Modular software routers like Click enable research and education on routing in software, but don’t allow experimentation with hardware and physical-layer issues.

NetFPGA aims to rectify this, by providing an easy-programmable hardware platform for network devices (NICs, switches, routers, etc.). Users interact with NetFPGA remotely, uploading programs that configure FPGAs on the device appropriately, and then observing the results. The initial version of NetFPGA didn’t have a CPU: instead, software to control the device runs on another computer, and accesses hardware registers remotely (using special Ethernet packets that are decoded by the NetFPGA board). The second version of NetFPGA has an optional hardware CPU.


This seems like a pretty incontrovertible “good idea.” One potential question is how the performance of an FPGA-based router compares to that of realistic production hardware — I’d suspect that NetFPGA-routers occupy a middle ground between software-only routers (easy to develop, poor performance) and production routers (difficult to program and expensive to fabricate, but good performance).

Leave a comment

Filed under Paper Summaries

“A Policy-Aware Switching Layer for Data Centers”

Many data center networks are configured as a single Layer 2 network, with the switches configured into a spanning tree to avoid routing loops (see the SEATTLE paper for some more background on how this works). However, in practice, simply finding a Layer 2 path to the appropriate destination host is not enough: data centers also include various “middleboxes” (e.g. firewalls, load balancers, and SSL offloaders) that intercept and sometimes modify Ethernet frames in transit. To ensure that traffic is routed to middleboxes as appropriate, network administrators typically modify Layer 2 path selection. This has a number of problems:

  • Correctness: How can the administrator be assured that frames always traverse the appropriate sequence of middleboxes, even in the face of network churn and switch failures?
  • Flexibility: Changing the middlebox routing policy should be easy to do.
  • Efficiency: Frames should not be routed to middleboxes that aren’t interested in those frames.

To address these problems, “A Policy-Aware Switching Layer for Data Centers” proposes that Layer 2 routing policies be specified in a declarative language at a centralized administrative location. These policies are compiled into then compiled into a set of rules that are installed onto “policy switches” (pswitches). Pswitches make Layer 2 routing decisions by consulting a policy table, rather than the “destination MAC => switch port” mapping table used in traditional Layer 2 switches. New policies can be deployed by simply modifying the centralized configuration, which then disseminates them to the appropriate pswitches.

Policies take the form “At location X, send frames like Y to Z”, where Z is a sequence of middlebox types. Writing policies in terms of middlebox types, rather than talking about individual middlebox instances, makes it easier to reason about the correctness and how to handle failures without violating the policy. Policies are compiled into sets of rules, which take the form “For frames that arrive from hop X like Y, send to hop Z” — this should admit an efficient implementation.

To avoid sending frames to middleboxes that are not interested in them, middleboxes are removed from the physical network data path—instead, pswitches explicitly forward frames to middleboxes as required. The authors argue that because data center interconnects are relatively low latency, this doesn’t have a major performance impact.


Making this idea work in practical networks requires addressing a lot of details. For example:

  • To support load balancers, the right-hand side of a policy can be a list of middlebox types. The pswitch uses a hash partitioning scheme that is careful to send all the packets belonging to either direction of a flow to the same destination.
  • To allow this work to be deployed without modifying the entire network, policy-based routing is implemented by encapsulating frames: to implement policy-based routing, a frame is encapsulated in another frame with a destination MAC of the next hop (middlebox or server) required by the policy. Frames are decapsulated before being delivered to middleboxes.


I like this paper. There are only two obvious concerns I can see: the deployment difficulties in any attempt to redesign how switches operate, and the performance concerns of (1) rule table lookups rather than normal Layer 2 routing (2) the additional latency of traversing middleboxes that are off the network data path. Compared with a traditional software-based router, their software-based pswitch implementation achieves 82% throughput and adds an additional 16% latency. Middleboxes deployed with this architecture suffer a lot more overhead: 40% throughput and twice the latency. Its hard to say how the performance of a hardware-based pswitch implementation would compare with regular hardware-based routers.

The authors argue that the additional hop required to traverse a middlebox is acceptable, because data center networks are typically low latency. This seems backward to me: data center applications might actually want to take advantage of low latency links (e.g. see RAMCloud), and won’t like the network infrastructure adding additional latency overhead without a compelling reason.

The idea of taking middleboxes off the data path seems orthogonal to the paper’s main contribution (which is the idea of a high-level specification of Layer 2 routing invariants). It might be nice to separate these two ideas more distinctly.

Leave a comment

Filed under Paper Summaries

“Internet Indirection Infrastructure”

Internet Indirection Infrastructure” (i3) proposes a communication abstraction in which applications send packets to identifiers, rather than network addresses. Receivers register triggers that express their interest in packets sent to a particular identifier— the i3 infrastructure takes care of the routing. i3 is designed to be a general-purpose substrate, which can be used to build communication patterns including multicast, anycast, load balanced server selection, and host mobility.

Basic Model

To register interest in data, a receiver registers a trigger for the pair (id, addr) — this specifies that packets sent to identifier id should be routed to IP address addr. Multiple receivers for a single ID can be registered, in which case the packet is delivered to all matching trigger addresses. i3 generalizes this basic model in a few ways:

  • Inexact matching. Identifiers are divided into two pieces: an “exact match prefix” of k bits, and a suffix that is used to do longest-prefix matches. That is, a trigger matches an identifier if the first k bits match exactly, and if no other trigger has a longer prefix match against the rest of the identifier.
  • Identifier stacks. Rather than sending a packet to a single identifier, packets can instead be sent to a stack of identifiers. Similarly, triggers are generalized to mean “When a match for id is found, push a, b, … onto the stack.” i3 looks for a trigger that matches the first element of the stack; if no such trigger is found, the first element is popped from the stack. This continues until the stack is empty or a matching trigger is found; if a match is found, the trigger’s identifier stack is pushed onto the packet’s stack, and routing continues anew. i3 assumes that this process will eventually result in matching a trigger against an identifier holding an IP address. When that occurs, the packet’s data and the rest of the identifier stack is sent to the address.
  • Private triggers. This is not an i3-level concept, but a design pattern that i3-using applications often follow. A “public” trigger is a trigger on a well-known identifier, and is used to initiate a session with a server; “private” triggers are used to communicate between pairs of end hosts. A pair of private trigger identifiers might be exchanged via a service’s public trigger identifier, and then subsequent communication will proceed via the private triggers.


i3 is implemented on top of a distributed hash table (DHT); the authors use the Chord DHT in the paper. The DHT provides a service to find the node associated with a key; for the purposes of i3, this allows a packet sender to find the node that holds the triggers for the first identifier in the packet’s identifier stack (the DHT lookup is only done on the “exact match prefix” of the identifier). The target i3 node then finds the matching trigger, if any, by doing a longest-prefix match against trigger identifiers. Once a match is found, the packet is routed to the receiver address or addresses via IP, or another DHT lookup occurs (if the trigger contains an ID stack, not an address).

Senders cache the IP address of the i3 node responsible for each identifier. Hence, in the common case, routing an i3 packet requires two physical network traversals: one to get from the sender to the i3 node, and then another from the i3 node to the receiver.

Example Usage

i3 provides direct support for simple multicast; anycast is supported using inexact identifier matching, and appropriate choices for the inexact suffix of the identifier. Host mobility can be supported by simply updating a receiver’s trigger. Multicast trees can be implemented by constructing routing trees in i3, using the identifier stack feature (triggers that “invoke” triggers can essentially be used to construct an arbitrary graph, and to do recursive queries over graphs).


In the paper’s experiments, the latency to route the first packet to an identifier is about 4 times that of raw IP routing, because a Chord lookup must be done (after applying some standard Chord tricks to try to pick the “closest” Chord node among the alternatives as the ring is traversed). To reduce the latency required to route subsequent packets, i3 employs two techniques:

  • Caching the address of the i3 node that owns a given identifier, as discussed above.
  • A receiver generates k random trigger identifiers, and then chooses the identifier that it can reach via IP with the lowest latency.

By having each receiver generate 15-30 random triggers with this technique, the authors are able to reduce the i3 routing latency for subsequent packets send to an identifier to between 2 and 1.5 times more than raw IP.


Overall, I liked this paper: it is an elegant idea that is explained well. A cynical view of this paper is that it isn’t all that innovative or “deep”: it is just a thin logic layer on top of a DHT. i3 is basically a generalization of the name-to-value lookup service provided by the DHT.

Both performance and security are serious issues with this design, and the paper didn’t really convince me that their approach addresses these problems.

I would have liked to see a performance comparison between IP-level multicast and anycast with implementations built on i3.

It would be interesting to compare i3 with group communication systems (GCS) that provide totally-ordered broadcast messages. What features could you add to i3 to make it easier to write reliable distributed systems?

Leave a comment

Filed under Paper Summaries

“Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications”

This paper describes the design of Chord, one of the earlier and most popular academic DHT systems (according to one measure, the Chord paper is the most-cited CS paper from 2001, along with a raft of other DHT papers). Chord’s design focuses on simplicity and offering provable performance and correctness properties. Chord provides a single service: given a key, it returns the address of the node storing the key. Other practical issues, like replication, caching, naming, and cryptographic authentication should be provided by an application level that runs on top of Chord.

Basic Protocol

In Chord, nodes and keys are arranged in a circle (the “identifier circle“). Each node maintains the address of the successor of that node; maintaining the correctness of these successor pointers is the critical correctness property of the protocol. Each key k is owned by the first node whose identifier (e.g. hash of IP address) follows k in the identifier circle. Therefore, keys can be located by simple sequential search; the search terminates when it reaches the node that ought to own the key.

To improve search performance, Chord also maintains a finger table. The finger table is essentially a skip list: if there are N nodes in the DHT, each finger table has O(log N) entries, holding the address of a node 1/2, 1/4, 1/8, … of the way “forward” from the current node in the identifier circle. This essentially allows the search procedure to perform a binary search of the identifier circle, to quickly approach the “neighorhood” of the target key. Note that this means that each node maintains more information about its local neighborhood than about distant nodes, which makes sense.

Joining A Chord Group

To join a Chord group, a joining node begins by computing its predecessor and finger tables. It then joins the group by notifying its successor node, which includes the new node as its predecessor. Each node runs a “stabilization” protocol to maintain the correctness of the successor pointers: each node n periodically asks whether the predecessor of its successor node is n; if not, a new node has joined the group, so n sets its successor to the new intermediate node, and notifies that node to make n the node’s predecessor. Once the successor pointers are updated, the new node has joined the Chord group, and can be accessed by subsequent lookup queries. In the background, keys that are now owned by n will be transferred from n‘s successor, and finger tables will be updated lazily: updating fingers promptly is not actually important for good performance, because fingers are used only to quickly reach the right “neighborhood.”

Handling Node Failure

If a node fails, the data on the node may be lost. This is not Chord’s concern, because it delegates responsibility for replication to the application. Instead, Chord must ensure that the Chord ring is not “broken”: lookup queries can continue, despite the fact that the predecessor of the failed node has an invalid successor node.

Chord achieves that by having each node maintain a list of successors, rather than a single one. If a node notices that its successor has failed, it simply tries the next node in the list. The paper argues that since the failure probabilities of elements of the successor list can be considered to be independent, using a sufficiently large successor list provides arbitrary protection against node failures. The stabilization protocol above is extended to apply to the entire list of successors, not just the first one.


Overall, this is a great paper, and it deserves its reputation as such. It also shows the benefit of combining good systems work with a top-flight theoretician (Karger).

Despite the paper’s formal tone, I was confused by their talk of “correctness”: they never actually define what their correctness criteria are. Their notion of “correctness” is apparently that the Chord ring remains live and can continue answering queries, even if the query answers are incorrect or undetected ring partitions occur. For example, they observe that when a join occurs and before stabilization finishes,

the nodes in the affected region [may] have incorrect successor pointers, or keys may not yet have migrated to newly joined nodes, and the lookup may fail. The higher-layer software using Chord will notice that the desired data was not found, and has the option of retrying the lookup after a pause.

That is quite disconcerting, and would be an impediment to using Chord-like systems as a backend for a reliable storage service. Importantly, many applications would like to distinguish between “lookup failed because of stabilization problems” and “lookup failed because the key was definitely not in the Chord ring.” As presented, the protocol does not appear to allow this distinction.

Another troubling aspect of the paper is the author’s attitude toward network partitions, which they regard as a “pathological case.” I thought this was surprising, because (short-lived) network partitions are not that rare. In fact, one of the explicit benefits of loose-consistency systems like DHTs is their ability to tolerate network partitions gracefully. They sketch a few ideas on how to detect and heal partitions, but don’t address this problem in any real depth.

The paper doesn’t talk about an obvious optimization: making “distance” in the Chord ring correlate with distance within the underlying physical network, to reduce communication costs. This is explored in depth by subsequent work.

1 Comment

Filed under Paper Summaries

“A Path Metric for Multi-Hop Wireless Routing”

A High-Throughput Path Metric for Multi-Hop Wireless Routing” presents the expected transmission count (ETX) metric for finding high-throughput paths on multi-hop wireless networks. By considering link loss rates, asymmetrical loss ratios between the directions of a link, and interference, the paper argues that ETX substantially outperforms routing protocols based on finding the shortest path (fewest hops) in an ad-hoc wireless network.


There are several shortcomings with using minimum hop count for routing in ad-hoc wireless networks:

  • Shortest-path routing protocols essentially assume that either a link works (probe packets are delivered successfully), or it doesn’t. While wireless networks can be roughly classified in this manner, wireless links have a much richer space of performance and loss characteristics than “it works” or “it doesn’t”.
  • Minimizing the hop count typically means maximizing the physical distance traveled by each hop, which leads to selecting links with weak signals and high loss rates. As the related RoofNet paper discusses, it is often preferable to utilize a path of short, high-quality links over a single long-distance, low-quality link.
  • In a dense ad-hoc network, there may be many different routes with the same number of hops. Deciding between routes on the basis of hop count is therefore not sufficient in any case.

There are problems with two naive alternatives to min-hop-count:

  • An improved routing metric could simply choose the path by maximizing the product of the per-link delivery ratios. This is inadequate, however, because it would result in preferring a two-hop route with perfect delivery to a one-hop route with a low loss rate — the latter route is actually better in many cases.
  • Choosing links on the basis of end-to-end delay is problematic, because of oscillation: once a route is utilized for traffic, its delay will often increase. This will result in the routing metric choosing another path, whose latency will then increase, causing it to switch back to the first path, and so on.

ETX Metric Design

The ETX metric for a link is the expected number of data transmissions required to send a packet of a certain size over the link. The ETX metric for a path is the sum of the per-link ETX metrics. ETX calculates per-link delivery ratios for both directions of a link (transmit and receive), to take into account the asymmetrical nature of wireless communication. Delivery ratios are calculated by periodic probes of a fixed size (sent once per second by default via broadcast); each node calculates per-link delivery ratios based on the most recent 10 seconds of activity.

ETX assumes that radios have a fixed transmit speed, and that congestion is not a factor. Neither assumption is very realistic: variable speed radios are now pervasive, and congestion might well be a factor in ad-hoc networks with sizable user bases. ETX also assumes that delivery rates are the same regardless of packet size, which is probably inaccurate (e.g. ACKs are small and take less time to transfer).


I was surprised that the paper spent so much time comparing ETX with min-hop-count metrics: it seems quite straightforward that min-hop-count would fair pretty poorly in an ad-hoc wireless network. In that sense, performance comparisons between ETX and min-hop-count are also not that informative: of course ETX will beat min-hop-count, because the latter is simply a terrible choice for this environment. It seems surprising that min-hop-count is truly the best example of prior work they could measure against — why not compare with the naive “maximize product of per-link delivery ratio” scheme, for example?

I didn’t get that much out of this paper: the RoofNet paper already summarizes an improved version of the ETX metric. Because the ETX evaluation compares it with such a pessimal alternative (min-hop-count), I didn’t get much out of the evaluation in this paper.


Filed under Paper Summaries

“Congestion Control for High Bandwidth-Delay Product Networks”

Most of the papers we’ve read on improving congestion behavior have an “incremental” flavor: they begin with traditional TCP and router behavior, and then suggest a conservative change that addresses some particular problem (e.g. slow start, better RTO estimation, fast retransmit, fair queueing, etc.). In contrast, this paper takes a “clean slate” approach:

Our initial objective is to step back and rethink Internet congestion control without caring about backward compatibility or deployment. If we were to build a new congestion control architecture from scratch, what might it look like?

Efficiency and fairness

In TCP, efficiency and fairness are coupled, because TCP uses a single procedure (“additive increase, multiplicative decrease“) to try to achieve both goals. The paper argues that these goals should really be treated separately: efficiency describes how close the link’s aggregate throughput approaches the link capacity, whereas fairness concerns how the current link throughput is distributed among the flows passing through a link.

Therefore, XCP separates these two goals: an efficiency controller is responsible for deciding whether to increase or decrease the link’s aggregate throughput, and a fairness controller is tasked with translating that aggregate feedback into per-flow feedback in order to achieve or maintain fairness. Separating these two components makes it easier to analyze their behavior (e.g. the efficiency controller doesn’t need to account for the number of flows), and allows new component designs to be considered in isolation.

The paper proposes using a multiplicative-increase, multiplicative-decrease (MIMD) for the efficiency controller, and a traditional additive-increase, multiplicative decrease (AIMD) scheme for the fairness controller. The paper argues that using MIMD for efficiency is important, because the AIMD policy traditionally used by TCP takes a long time to reach full capacity on links with a high bandwidth-delay product.

Explicit congestion signals

XCP moves away from packet drops as an implicit signal of network conditions, for a few pretty convincing reasons:

  • Congestion is not the only source of loss, so clients cannot distinguish between losses due to errors and losses due to congestion.
  • Detecting packet loss requires waiting for the RTO timeout, duplicate ACKs, or similar — the extra delay incurred makes clients respond more slowly to congestion signals, which makes the network more unstable. This also means clients must be more conservative when probing for network capacity, because the penalty for “overshooting” is more severe.
  • Packet loss is a binary signal; an explicit congestion notification can carry more information about the state of the network.

That makes a lot of sense, but it seems quite obvious and not really novel: ECN for IP has been around for a long time, and DECNet did something very similar years before this paper was written. I would guess the slow deployment of ECN is mostly related to pragmatism (e.g. support for ECN in IP stacks and routers), which developing a completely new transport protocol would only exacerbate.


In XCP, senders attach a “congestion header” to each outgoing packet. This contains the sender’s current congestion window size, RTT estimate, and the sender’s desired window size increase (“H_feedback”). H_feedback is modified by XCP routers, and the updated value is returned in the ACK for a sent packet. The feedback value instructs senders to either increase or decrease their congestion windows.

In an XCP router, the efficiency controller first tries to adjust the aggregate link throughput to approach full utilization. The output of the efficiency controller is then fed to the fairness controller, which translates the global increase/decrease signal into feedback for individual senders, in such a way that fairness is increased or maintained.


The feasibility of separating fairness and efficiency hinges on whether these two goals really are independent: in practice, can the efficiency controller be replaced without any consideration of the fairness controller, and vice versa? I was initially skeptical that such a clean separation would be possible, but I think the paper’s scheme is pretty compelling.

Part of the paper’s motivation is the idea that networks with high bandwidth-delay will become increasingly common in the future. If we ignore special-cases like satellite links, then high-bandwidth networks are a certainty, but is it true that latency improvements won’t match bandwidth? I suppose latency is fundamentally limited by the speed of light in a way that bandwidth is not. “Latency lags bandwidth” is an interesting paper from Dave Patterson that examines this trend in general, and argues that latency will continue to lag bandwidth in the future.

Given the enormous administrative cost of deploying a new transport protocol, one wonders if the goals identified by this paper can’t be reached more easily by incremental changes to TCP, rather than a clean-slate redesign. XCP just doesn’t seem fundamentally different enough to warrant throwing out TCP.

1 Comment

Filed under Paper Summaries

“Random Early Detection Gateways for Congestion Avoidance”

Like earlier work on fair queuing, Random Early Detection (RED) argues that the router must play an important role in achieving effective congestion control. However, the RED approach is markedly different from fair queuing. The basic idea in RED is to start taking corrective action as soon as approaching congestion is detected, rather than waiting until the queue buffers are completely full.

RED divides this task into two separate algorithms:

  • Imminent congestion is detected by calculating the average queue size. This is done with a low-pass filter with an exponential moving average; the idea being to distinguish between periods of transient buffering, caused by bursty traffic, and longer-term buffering caused by imminent congestion. RED defines two thresholds for queue size, a minimum size and a maximum size. In essence, the router can be in three states: “no congestion” (below minimum threshold), “severe congestion” (above maximum threshold), and “moderate congestion” (between minimum and maximum thresholds).
  • Once congestion has been detected, RED must then decide on corrective action. To reduce congestion, RED “marks” a packet to inform senders to reduce their transmission rate. Marking can be implemented either by dropping the packet (as in traditional TCP), or by setting a bit in the packet header to inform the sender that congestion is occurring. When the queue size is below the minimum threshold, no packets need to be marked. When the queue size is above the maximum, every incoming packet is marked (or dropped if necessary). When the queue size is between these two thresholds, each incoming packet is marked with a probability that is a function of the current queue size (that is, as the queue size increases, the probability that packets will be marked increases).

This scheme has several important properties. First, it responds to approaching congestion more promptly than a traditional “Tail Drop” policy, which gives senders more time to moderate their transmission rates. While it seems counter-intuitive to mark or drop packets when there is buffer space available, it makes avoiding catastrophic congestion collapse more likely because senders require time to reduce their transmission rates. Second, high-bandwidth senders are more likely to have their packets dropped than low-bandwidth senders, which contributes to fair usage of the link capacity.

Finally, RED does not have a bias against “bursty” traffic, unlike Tail Drop or Random Drop. Tail Drop is biased against bursty traffic, because queue overflows are likely when a cluster of packets from the same connection arrive at about the same time. While the paper confirms their claims about bursty traffic with simulation results, I didn’t get a good intuition for why RED does better than Random Drop on bursty flows — hopefully we can discuss this in class.


The assumptions behind this paper are interesting, and quite different from the assumptions made by the fair queueing work. For example, the papers on fair queueing assume that controlling misbehaving users ought to be a responsibility of the network infrastructure. In this paper, handling misbehaving users is basically an afterthought: RED gives administrators the tools to identify misbehaving users, and vaguely suggests that this information might be used by “higher policy layers” to modify those user’s bandwidth allocations.

In a similar vein, I thought it was remarkable that the paper suggests that FIFO scheduling is “useful for sharing delay among connections, reducing delay for a particular connection during its periods of burstiness.” This seems to be describing exactly the same behavior that the fair queuing papers talk about: FIFO scheduling yields increased latency for low-bandwidth interactive flows like Telnet that share a link with a high-bandwidth latency-tolerant flow like file transfer. The FQ papers view this as undesirable (in fact, they go further and give low-bandwidth sessions an extra “promptness” bonus), whereas this paper views this behavior as desirable. It would be interesting to discuss what gives rise to this difference of opinion.

Another difference in assumptions is the importance of ensuring fair behavior for bursty traffic. I didn’t think this was very well-motivated by the paper: is bursty traffic actually a problem in practice, and why can’t the client space out their traffic more effectively? The example of file transfer with a small window on a connection with a high bandwidth-delay product seems like it could be resolved by better sender-side algorithms, for example (i.e. why is the window small if the bandwidth is large?). Perhaps this points toward the well-known problems with TCP’s additive increase for connections with a high RTT, but I didn’t see a convincing argument in the paper.

It seems like this scheme would be much more effective when marking is implemented with explicit sender notification rather than packet drops. Packet drops are both more severe (sender responds by drastically curtailing transmission rate), and take longer to detect: the sender might need to wait for the full RTO to expire, for example.

Leave a comment

Filed under Paper Summaries

“Understanding BGP Misconfiguration”

This paper presents an empirical study of BGP misconfigurations, based on sampling the BGP traffic seen by 19 different ASs. The paper focuses on two types of misconfigurations:

  1. origin misconfigurations occur when a mistaken route origin is inserted (which might lead to directing traffic to the wrong AS), and
  2. export misconfigurations occur when a router mistakenly exports a route (which might lead to directing more traffic to an AS than intended).

The authors approximate misconfigurations by looking for BGP updates that are revoked quickly (typically because the error is discovered by an operator). To differentiate between misconfigurations and short-lived legitimate changes (e.g. for load balancing), the authors conducted an email survey of the network operators involved. To detect export misconfigurations, they applied a heuristic (Gao’s algorithm) to infer the relationships between ASs — once inter-AS relationships are known, export misconfigurations are clear.

The authors document a significant rate of misconfiguration: even though their analysis of misconfigurations is conservative, they found that almost 75% of new route announcements are the result of misconfiguration. Despite the frequent occurrence of misconfigurations, the authors found that these problems had relatively little impact on connectivity: for example, they found that about 25 BGP misconfigurations per day resulted in connectivity loss, compared with 1000 instances per day of connectivity loss due to failures. I think this is unsurprising: the problem with BGP misconfiguration is not that it results in routine connectivity loss, but rather that it encourages occasional catastrophic global connectivity problems (e.g. the YouTube in Pakistan incident).

When the paper mentions that characterizing a BGP misconfiguration is hard to do precisely, one wonders whether it is possible to differentiate automatically between valid and invalid route updates (e.g. via machine learning). Would it be possible to train a model to predict the likelihood that a proposed route updated is invalid, and to estimate the “risk” to network connectivity that would be incurred by applying the update? At the very least, such a tool would be useful to help operators identify the possible cause of a routing failure (that is, a recent high-risk route update that was applied); routers might even require operator intervention before agreeing to apply high-risk updates. Some searching reveals papers by Wu and Feng and Li et al. that apply ML to detecting abnormal / high-risk BGP events.

I liked how the paper made a number of sensible suggestions to reduce the rate of BGP misconfigurations. It is remarkable that many of their suggestions don’t require significant technological changes: by tweaking the user-interfaces used by network operators, the frequency of “slips” could be significantly reduced. However, there is a clear difference in rigor between the two halves of the paper: while the authors diagnose misconfigurations through careful analysis, their work to classify the causes of those misconfigurations seemed a little lazy in comparison (emailing network operators, and then (1) trusting their answers completely (2) generalizing the results of the email survey to operators who didn’t reply). The work could be improved by studying the human factors that contribute to misconfigurations more carefully.

Also, I wonder whether the same factors that contribute to routine BGP misconfigurations also contribute to the occasional catastrophic connectivity failures. The methodology here reminds me a bit of studying minor operator errors at a nuclear reactor, and assuming that preventing minor errors will also contribute to preventing catastrophic meltdowns. That is probably mostly true, but this work would be well-complemented by an empirical study of some notable recent examples of global routing failures.


Filed under Paper Summaries

“Interdomain Internet Routing”

This lecture provides an introduction to the global interdomain routing system. The reading discusses BGP4, the current protocol for propagating information on how packets should be routed among the different autonomous systems (AS) that make up the Internet. There are two flavors of BGP:

  1. External BGP is used to propagate inter-AS routing information from one AS to another.
  2. Internal BGP is used to propagate inter-AS routing information within a single AS; that is, it propagates information that an AS has learned via eBGP to the other routers within the AS. This is distinct from an AS-local route propagation protocol (“IGP”), which just describes how to route packets within a single AS.

After reviewing the goals of Internet routing and the business practices followed by network operators, the lecture discusses the BGP protocol, which is quite straightforward. BGP configuration within a single AS is also discussed: for large ASs with thousands of routers, route updates must be propagated carefully to achieve both scalability and consistency. From a database perspective, propagating BGP configuration updates within a single AS doesn’t seem like a fundamentally hard problem, so it was interesting to see how it was treated by the reading.

I liked how the lecture discussed the economic motivations of the participants in the global routing system. It would be interesting to apply game theory to the empirical behavior of ISPs, to understand how close their behavior approaches “optimal”: how successful are ISPs at minimizing the cost incurred by their routing choices, and maximizing the value they provide to their customers? Could the efficiency, cost-effectiveness, or reliability of the routing system be improved if a single authority was designated, rather than relying on competition? If theory suggests that the market participants are acting sub-optimally, is there an opportunity for arbitrage, essentially acting as a “virtual ISP”? Alternatively, what if there was auction-oriented approach to matching supply (packet routing capacity) with demand (ISPs and customers that need to route packets from A to B).

I was really surprised by the security and robustness properties of the currently-deployed BGP architecture. Given the massive economic value of the Internet and the obvious presence of malicious actors, it seems amazing that, for example, there is no authenticated mapping from IP prefixes to autonomous systems or real-world identities. In class, I’d be interested to discuss what factors have prevented more widespread deployment of S-BGP and related technologies, which is only eluded to briefly in the reading.


Filed under Paper Summaries