Scripts for writing papers

One of the nice things about LaTeX is that the definitive version of the document is stored as plain text. This simplifies version control and avoids the need to use a complicated editor with an opaque binary file format. It also means that writing scripts to check for writing mistakes is relatively easy. Of course, detecting grammatical errors in general is very challenging, but certain classes of mistakes are easy to check for. Matt Might has posted a few scripts which I’ve found quite useful.

I had cause to write another script to check for consistent use of hyphens. For example, both “non-deterministic” and “nondeterministic” are acceptable, but a single document should pick one variant and use it consistently. You can find the script here; suggestions for improvement (rather, patches) are welcome.

What other common mistakes are amenable to a simple script? Checking for consistent use of the Oxford comma should be straightforward, for example. There are also scripts like chktex that look for mistakes in LaTeX usage (although I’ve personally found chktex to be too noisy to be useful).

Naturally, there are limits on what you can do without parsing natural language. After The Deadline claims to provide some NLP-based grammar analysis, but I haven’t used it personally.


  • atdtool, a command-line interface to After The Deadline
  • diction, a GNU tool to check for common writing mistakes
  • LanguageTool, another open source style and grammar checker
  • TextLint, a writing style checker


Filed under Uncategorized

Too Good To Be Believed

In the (excellent) Sinfonia SOSP ’07 paper, the authors compare a group communication system (GCS) built using Sinfonia with the open source Spread GCS. Although I like the Sinfonia paper a lot, I thought this evaluation was actually detrimental to the paper. The authors present several graphs comparing the performance of SinfoniaGCS with Spread, such as this one:

Performance comparison of Spread and SinfoniaGCS

Clearly, SinfoniaGCS vastly outperforms Spread in this configuration. At first glance, this might seem like a great experiment: the authors have demonstrated that you can use Sinfonia to build a high-performance GCS, right?

To me, including this evaluation in the paper is not helpful, because it raises more questions than it answers. There are two possibilities:

  1. Spread’s performance is truly terrible. In that case, what are we to learn from comparing the performance of SinfoniaGCS with a worst-in-class alternative?
  2. Spread is misconfigured. The paper notes that Spread wasn’t configured to use IP broadcast or multicast, and that SinfoniaGCS was allowed to batch together 128 messages at a time; either change could have a huge performance impact. Again, there is little to learn from an apples-to-oranges comparison between a carefully tuned Sinfonia system and a misconfigured Spread system.

A convincing performance study would show that by using the Sinfonia infrastructure, one can build a GCS that approaches the performance of an optimized GCS written from scratch (and perhaps that using Sinfonia leads to a smaller/simpler GCS implementation). Alternatively, if using Sinfonia really does allow dramatically better performance, the reasons for the performance difference should be explored: Sinfonia is not magic, and if the authors could have identified some reasons for why traditional GCS designs perform poorly on modern datacenter networks, that would be an interesting result.

Instead, the authors merely speculate that using IP broadcast/multicast would improve Spread performance and leave it at that. Unfortunately, the result is a performance study from which we can learn very little.

Leave a comment

Filed under Research Notes

Benchmarking Dataflow Graphs in Ruby

Our group at UC Berkeley recently released Bloom, a new programming language for distributed computing. Specifically, we released an initial alpha of “Bud,” which is a Ruby DSL that lets you embed (distributed) declarative rules inside Ruby classes.

The current Bud prototype has a pretty naive evaluation scheme (deliberately); so I’ve been idly thinking about how to improve its performance. The standard way to evaluate Datalog rules is fairly similar to how a typical relational database evaluates queries: the system produces a dataflow graph (“query plan”), which consists of nodes (e.g., project, join or scan operators) connected by edges. The dataflow graph typically starts with the input relations (e.g., base tables, EDB for Datalog) and computes the derived relations (e.g., the user’s query in the case of a database, the IDB relations for Datalog). There is a lot more to building an efficient Datalog evaluator, but I’ll leave that for future posts.

To get good performance for Bloom programs, we’ll likely want to generate a dataflow graph. So how should we evaluate this graph? Two options come to mind:

  1. Represent nodes in the graph as Ruby objects and push tuples through the graph by evaluating Ruby.
  2. Represent nodes in the graph using C code and connect the Ruby query planner to the C runtime in some fashion (e.g., have the planner generate a description of the desired dataflow and parse that description in a C module, have the Ruby code generate C source code directly, or use something like LLVM to generate machine code at runtime).

Naturally the first option would be easier, but how does it perform? I wrote a quick and dirty microbenchmark to get a rough idea. The benchmark constructs a dataflow graph of 7 nodes: 4 “predicate” nodes that check whether the tuple matches a constant, 2 “join” nodes that probe a 6,000 element hash table (modeling half of a symmetric hash join), and 1 “sink” node that appends the tuple to an array (modeling storage into an in-memory relation). I ran 1 million tuples through the graph, and repeated this 10 times (dropping the first run to allow caches to warmup). I ran the benchmark on several Ruby implementations:

  • MRI 1.8.7-p334
  • MRI 1.9.2-p180
  • JRuby 1.6.0
  • Rubinius git checkout from April 11th (version string “1.2.4dev (1.8.7 73390817 yyyy-mm-dd JI)”)


The average time taken to route 1 million tuples through the dataflow graph, for each Ruby implementation:

I was surprised to see how much faster MRI 1.9 is than MRI 1.8. It is also interesting that JRuby does pretty decently.

So how should we interpret these results? Well, let’s compare the performance of the Ruby implementations with a more-or-less equivalent dataflow microbenchmark written in C:


After adding the C results, we can put the Ruby benchmark results into perspective: although recent Ruby implementations are considerably faster than MRI 1.8, they are still much slower (> 10x) than C for this particular microbenchmark. Is that surprising? Not really — Ruby is certainly not an ideal choice for high-performance, data-intensive computing. Despite recent progress in building faster Ruby implementations, the performance gap (for this microbenchmark) remains considerable.

Obviously, take these results with a grain of salt: many more factors would contribute to the performance of a complete system than a trivial microbenchmark like this. There are also factors that are difficult to measure in a small microbenchmark: for example, using a pure Ruby implementation would force the GC to manage the state stored in Bloom collections, which would result in additional performance overhead.

Additional Details

  • Benchmarks performed on a late 2010 Macbook Air (2.13Ghz Core 2), running OSX 10.6.7. The C benchmark was compiled with Apple’s GCC (4.2.1) at optimization level “-O2”, and was (dynamically) linked against APR 1.4.2.
  • Ruby source code
  • C source code
  • Raw benchmark data

1 Comment

Filed under Research Notes

“Not-a-Bot: Improving Service Availability in the Face of Botnet Attacks”

This paper focuses on distinguishing human-generated activity from bot-generated activity. Then, human-generated activity can be given preferential treatment (e.g. favorable routing of traffic, not being treated as spam). Their measure for distinguishing human-generated actions from machine-generated actions is pretty coarse and imprecise: an action is human-generated if it is preceded by keyboard or mouse input within a certain amount of time.

To implement this scheme, they go into considerable (exhaustive) detail about how to use the Trusted Computing Module (TPM) to build a trusted path between the physical input devices (keyboard, mouse) and a small piece of software called the attestor. To certify an action as human-generated, applications ask the attester for an attestation, passing a hash of the content to the attested for. If there has been user input within a predefined period, the attester returns a cryptographically-signed token that can be attached to the user action. When an upstream service receives the user action (e.g. HTTP request, email), it can verify the attestation by hashing the content of the action, and checking the cryptographic signature. Incorporating the content hash prevents an attestation for action x being used instead with action y. The verifier also needs to check that attestations are not reused, so the attester includes a nonce in the attestation token.

It is possible that a bot can monitor user actions, and submit malicious content to the attester whenever the user uses an input device. This would allow attestations to be created for malicious content, which means upstream software cannot blindly trust attested-for content. To reduce the impact of this attack, the paper suggests rate-limiting attestations to one per second.


I liked how the paper discussed an alternative approach to the same problem (having the attester track keyboard and mouse inputs, and then match that recorded history against the content that is to be attested, looking for a correspondence). Many papers present the solution they chose as the only alternative, when in fact it usually represents only one point in a much richer design space.

In some sense, the inverse of the proposed functionality would be more useful: i.e. being able to assert “this content was definitely bot-generated.” As proposed, it might be very hard for upstream services to make use of the certifications unless this idea saw widespread adoption. For example, suppose that 0.1% of your traffic is guaranteed by NAB to be human-generated. The remaining traffic may or may not be bot-generated, so you effectively can’t discriminate against it.

The paper suggests that, using this approach, human-generated content on a bot-infested machine can be effectively distinguished from bot-generated traffic. This seems pretty unlikely: the bot software can simply suppress all outgoing attestations (e.g. by installing a rootkit and interfering with the API used to request attestations), leaving upstream software in the same state as they would be without NAB.

Leave a comment

Filed under Paper Summaries

“BotGraph: Large Scale Spamming Botnet Detection”

Botnets are used for various nefarious ends; one popular use is sending spam email by creating and then using accounts on free webmail providers like Hotmail and Google Mail. In the past, CAPTCHAs have been used to try to prevent this, but they are increasingly ineffective. Hence, the BotGraph paper proposes an algorithm for detecting bot-created accounts by analyzing user access behavior. They describe the algorithm, its implementation with Dryad, and present experimental results from real-world Hotmail access logs.


BotGraph employs three different ideas for detecting automated users:

  1. They regard sudden spikes in the number of accounts created by a single IP as suspicious. Hence, they use a simple exponentially-weighted moving average (EWMA) to detect such spikes, and throttle/rate-limit account signups from suspicious IPs. This has the effect of making it more difficult for spammers to obtain webmail accounts.
  2. They argue that the number of bot machines will be much smaller than the number of bot-created webmail accounts; hence, one bot machine will access a large number of accounts. They also argue that a single bot-created webmail account will be accessed from multiple bots on different autonomous systems (ASs), due to churn in the botnet (although this seems pretty unconvincing to me), and the fact that rate-limiting makes it more difficult to create large numbers of bot accounts. Hence, they look for pairs of user accounts that had logins from an overlapping set of ASs.
  3. Finally, they consider a user’s email-sending behavior:

    Normal users usually send a small number of emails per day on average, with different email sizes. On the other hand, bot-users usually send many emails per day, with identical or similar email sizes

    Hence, they regard users who send 3+ emails per day as “suspicious”; they also regard as suspicious users whose email-size distributions are dissimilar from most other users.

They use feature #1 primarily to rate-throttle new account creations. Feature #3 is used to avoid false positives.

Feature #2 is the primary focus of the paper. They construct a user-user graph with a vertex for each user account. Each edge has a weight that gives the number of shared login ASs — that is, the number of ASs that were used to login to both accounts. Within the user-user graph, they look for connected components with an edge weight over a threshold T: they begin by finding components with T=2, and then iteratively increasingly the threshold until each component has no more than 100 members.


They describe two ways to implement the construction of the user-user graph using a data-parallel system like MapReduce or Dryad, using the login log from Hotmail (~220GB for one month of data):

  1. Partition the login records by client IP. Emit an intermediate record (i, j, k) for each shared login on the same day from AS k to accounts i and j. In the reduce phase, group on (i, j) and sum. The problem with this approach is that it requires a lot of communication: most edges in the user-user graph have weight 1, and hence can be dropped, but this approach still requires sending them over the network.
  2. Partition the login records by user name. For each partition, compute a “summary” of the IP-day keys present for users in that partition (the paper doesn’t specify the nature of the summary, but presumably it is analogous to a Bloom filter). Each partition sends its summary to every other partition. Using the summaries, each partition can exchange login records with other partitions in a way that allows edge weights to be computed, but doesn’t require sending weight 1 edges over the network.

They argue that the second method can’t be implemented with Map and Reduce, although I’m not sure if I believe them: multicasting can be done by writing to HDFS, as can shipping data between logical partitions.


I think the major problem with their experimental results is that there’s effectively no adversary: botnet operators presumably weren’t aware of this technique when the experiments were performed. Hence, they haven’t adapted their tactics — which might actually be quite easy to do.

For example, it seems like it would be quite easy to defeat their EWMA-based throttling by simply increasing the number of signups/time gradually. Essentially, the bot machine acts like an HTTP proxy with a gradually-increasing user population. One can imagine such a bot even mimicking the traffic patterns exhibited by a real-world proxy (e.g. increase at 9AM, decrease at 5PM). Certainly using a simple EWMA seems too primitive to defeat a dedicated adversary.

Similarly, it also seems quite easy to avoid sharing a single webmail account among multiple botnets: simply assign a single webmail account to a single bot machine, and don’t reuse webmail accounts if the bot machine becomes inaccessible. The idea, again, is to simulate an HTTP proxy that accesses a large number of webmail accounts. The paper’s argument that “churn” requires reuse of webmail accounts “to maximize bot-account utilization” is unconvincing and unsubstantiated. Since this is the entire principle upon which their technique is based, I’d be quite concerned that a relatively simple adaptation on the part of botnet operators would make this analysis ineffective.

I thought the paper’s wide-eyed tone toward using MapReduce-style systems for graph algorithms was annoying. Lots of people do large-scale graph algorithms using MapReduce-style systems; in fact, that’s one of the main things MapReduce was originally designed for (e.g. computing PageRank). The paper is not novel in this respect, and I was surprised that they didn’t cite one of the many prior papers on this subject.

1 Comment

Filed under Paper Summaries

“Cutting the Electric Bill for Internet-Scale Systems”

This paper begins with three observations:

  1. Energy-related costs are an increasingly large portion of total data center operating expenses.
  2. The cost of electricity can vary significantly between different times and between different regions at the same time.
  3. Many distributed systems already have the ability to dynamically route requests to different hosts and physical regions (for example, most CDNs try to minimize client latency and bandwidth costs by routing requests to a data center “close” to the client.)

Therefore, the paper investigates the feasibility of shifting load among replicas of a service that are located in different geographic regions, according to the current price of electricity in each region. For this to be effective, several things must be true:

  1. There must be significant variation in the price of electricity available in different regions at the same time.
  2. Data centers must be energy proportional: as the load on a data center is decreased by a factor of k, its energy usage should decrease by the same factor.
  3. Routing traffic to minimize the cost of electricity may result in increasing client latency and using more bandwidth (since cheap power might be far away from the client); the additional routers traversed might also use additional energy.

To answer the first question, the authors conduct a detailed empirical study of the cost of energy in different regions across the US, and compare that information with traffic logs from Akamai’s CDN. The authors use the Akamai traffic data to estimate how much the cost of electricity could be reduced by routing requests to the cheapest available electricity source, subject to various additional constraints.

The authors don’t do much to address the second question: they admit that the effectiveness of this technique depends heavily on energy-proportionality, but most computing equipment is not very energy-proportional (idle power consumption of a single system is typically ~60% of peak power usage, for example). Since energy-proportionality is the subject of much recent research, they express the hope that future hardware will be more energy-proportional. Finally, they carefully consider the impact of electricity-price-based routing on other optimization goals: for example, they consider only changing routes in a way that doesn’t result in any increased bandwidth charges (due to the “95-5” pricing scheme that most bandwidth providers use). A realistic implementation of this technique would consider electricity cost as one factor in a multi-variable optimization problem: we want to simultaneously minimize electricity cost, minimize client-perceived latency and minimize bandwidth charges, for example.

Summary of Results

The authors found significant asymmetry in electricity prices between geographic areas; furthermore, this asymmetry was dynamic (different regions were cheaper at different times). These are promising results for dynamic routing of requests based on electricity prices.

When cluster energy usage is completely proportional to load and bandwidth cost is not considered, price-sensitive routing can reduce energy costs by ~40%. The savings drop to only 5% if the energy-proportionality of current hardware is used, and the savings drop to a third of that if we are constrained to not increase bandwidth costs at all (assuming 95-5 pricing). Hence, this technique is only really effective if energy-proportional data centers are widely deployed.


I thought this was a great paper. The basic idea is simple, but their empirical study of the US electricity market was carefully done, and the results are instructive.

One interesting question is what would happen to the electricity market if techniques like these were widely deployed. Essentially, electricity consumption would become more price-elastic. When a given region offers a lower price, demand could move to that region quite quickly, which might act to drive up the price. Conversely, it would lower demand in higher-priced regions, lowering the price — and hence benefiting more inelastic energy consumers in that region.


Filed under Paper Summaries

“Scalable Reliable Multicast”

The SRM (Scalable Reliable Multicast) argues that, unlike reliable unicast protocols, different applications vary widely in their requirements for reliable multicast delivery. Hence:

One cannot make a single reliable multicast delivery scheme that simultaneously meets the functionality, scalability, and efficiency requirements of all applications.

Hence, they propose a model in which a generic skeleton is “fleshed out with application specific details.” The application provides a means to talk about how much data has been sent and received (“application data units”, unlike the connection-state oriented acknowledgments in TCP), a means for allocating bandwidth among the members of the group, and a means for individual nodes to decide how to allocate their local outgoing bandwidth. Following this model, their multicast framework only provides best-effort packet delivery, with possible duplication and reordering of packets — they believe that applications can build stronger reliable delivery and ordering properties on top of their framework, as needed. Best-effort multicast is actually implemented using IP multicast.

Unicast vs. Multicast

The authors point out two differences between reliable delivery for unicast vs. multicast that I thought were interesting:

  • In any reliable delivery protocol, one party must take responsibility for detecting lost data and retransmitting it. In unicast, either the sender or receiver can play this role equally well (TCP uses the sender, other protocols like NETBLT use the receiver). In multicast, sender-side delivery state is problematic: the sender must track the set of active recipients and the current state of each recipient, which is expensive, and difficult to do as the multicast group changes. In some sense, the whole point of multicast is to relieve the sender of that responsibility. Hence, they argue that receiver-side delivery state is better for multicast: group membership isn’t relevant, and the burden of keeping per-receiver state is avoided.
  • A reliable delivery protocol also needs a vocabulary to talk about how much data has been sent or received. Typical unicast protocols use a vocabulary based on communication state (typically either bytes or packets (segments)). This is not ideal for multicast, because a newly-joining recipient doesn’t share that communication state, and hence can’t interpret byte-oriented or packet-oriented messages. SRM instead argues for talking in terms of “application data units.”

wb Framework

The example multicast application in the paper is wb, a shared distributed whiteboard. This has the advantage that protocol commands (e.g. draw shape X at location Y) are mostly idempotent (if associated with a timestamp), so the underlying multicast protocol doesn’t need to enforce a total order on deliveries.

In wb, new operations are multicast to the entire group. When a receiver detects a loss (by detecting gaps in sequence numbers), it starts a “repair request” timer. The timer value is determined by the distance of the receiver from the original data source. When the timer expires, the recipient multicasts a “repair request” to the entire group, asking for the missing data. If a recipient sees another repair request for the same data before its timer expires, it suppresses its own repair request. Retransmission of missing data is handled similarly: when nodes receive repair requests for data they have seen, they start a response timer based on their distance from the repair request source. When the timer expires, they multicast the requested data to the entire group. Any other nodes that have a response timer for the requested data suppress their own timers. They also propose a bandwidth allocation scheme to divide the available bandwidth among new data operations and repair data.

Tuning the Request/Response Repair Timers

It is important that the repair request and repair response timers at different nodes be de-synchronized, to avoid redundant messages. The paper observes that certain topologies require methods for achieving de-synchronization: in a simple “chain” topology, seeding the timer with network distance is sufficient. For a “star” topology, all nodes are the same distance from the data source, so randomization must be used. A combination of these techniques must be used for a tree topology. Rather than requiring the timers be tuned for each individual network, they instead propose an adaptive algorithm that uses prior request/response behavior to tune the timers automatically.

Leave a comment

Filed under Paper Summaries

“Skilled in the Art of Being Idle”

Skilled in the Art of Being Idle” looks at how to reduce energy consumption by network end hosts (primarily desktops and laptops). Modern computers have various “sleep” states that allow reduced power consumption during idle periods. However, putting a computer to sleep has several costs:

  1. Transitioning into and out of a sleep state requires time (the paper cites a recent paper that found that typical machines take 3-8 seconds to enter “S3” sleep state, and 3-5 seconds to resume).
  2. Sleeping hosts cannot respond to network packets; hence, they can lose their network presence (e.g. DHCP lease can expire and be reassigned to another host). They also cannot run periodic tasks (e.g. backup or virus scanning).

A naive approach would put idle nodes to sleep, and then awaken them (via established “wake-on-LAN” support) when a packet is delivered to the node. This is insufficient: the authors demonstrate that in both home and office environments, the inter-arrival time of packets destined for idle computers is too small, so the cost of transitioning into and out of sleep state would negate any significant power savings. To avoid this problem, prior work has proposed using a proxy to handle network traffic intended for a sleeping node. The proxy can either ignore the traffic (if appropriate); handle the packet itself (e.g. by responding to an ARP query for the sleeping node’s IP), or it can awaken the sleeping node and forward the packet to it, if necessary. The effectiveness of proxying therefore depends on most traffic for an idle node fitting into the first two categories.

The paper is an empirical study of 250 machines owned by Intel employees, to assess the need for proxying in practice; the potential benefit of proxying; the network traffic that must be handled by a proxy; and so on.

Summary of Findings

  • Most machines are idle, most of the time — on average, machines were idle 90% of the time.
  • Home and office idle network traffic is markedly different, in both inter-arrival time of packets for idle machines (offices have more such traffic), and the nature of this traffic.
  • A power-saving proxy would need to handle broadcast, multicast, and unicast traffic to achieve significant gains. However, broadcast and multicast traffic represents “low-hanging fruit”: a proxy that handles only broadcast and multicast traffic would recover 80% of the idle time in home environments, and over 50% of the idle time in office environments.
  • For broadcast, address resolution (ARP, NBNS) and service discovery (SSDP for UPnP devices) protocols are the dominant sources of traffic for idle nodes. Both kinds of traffic are easy to proxy.
  • For multicast environments, routing traffic (HSRP, PIM) is the dominant source of traffic for idle nodes in an office environment. In a home environment, service discovery (SSDP) is dominant.
  • Their results for unicast traffic are less clear. They argue that only outgoing TCP connections dominate unicast traffic for idle nodes, and that less than 25% of this traffic is the result of some action a node initiated before becoming idle (and hence might need to be maintained in the idle state). They argue that outgoing traffic initiated while the machine is idle can often be batched together, or avoided entirely.


Filed under Paper Summaries

“Scalable Application Layer Multicast”

Multicast is clearly an efficient technique for applications with one-to-many communication patterns, but the deployment of in-network multicast has been slow. Therefore, there have been a number of proposals for implementing multicast at the application level, as an overlay over the physical network. This is not as efficient as true in-network multicast (because the same message may be sent over the same link multiple times), but is much more flexible and easier to deploy.

Scalable Application Layer Multicast” is one such proposal for application-layer multicast via an overlay network; their proposed protocol is called NICE. Their focus is on applications that require low-latency delivery of relatively low-bandwidth data streams to a large set of recipients, although they argue that their techniques are also applicable to high-volume streams with some minor changes.

Network Topology

In NICE, nodes are arranged into layers; a single node can appear in more than one layer. Every node belongs to the lowest layer, L0. The nodes in a layer are arranged (automatically) into clusters of nodes that are “near” to one another (in terms of network distance/latency). Each cluster has a leader, which is the node in the “center” of the cluster (NICE tries to make the leader the node that has the smallest maximal latency to the other nodes in the cluster). Layer L1 consists of all the cluster leaders from L0; the clustering algorithm is applied to L1 in turn, yielding another set of cluster leaders which form L2, and so forth. The height of the tree is determined by the number of nodes and a constant k: each cluster contains between k and 3k-1 nodes.

To multicast a data message, a node forwards the message to every cluster peer in every layer in which the node belongs, except that a node never forwards a message back to the message’s previous hop.


To join the multicast group, a node begins by contacting a designated node called the Rendezvous Point (RP). The RP is typically the root of the NICE tree. The joining node walks down the tree from the root, choosing the child node that is closest to it (lowest latency).

Cluster leaders periodically check whether the cluster size constraint (k <= size <= 3k-1) has been violated; if so, they initiate a cluster merge or split, as appropriate. Splitting a cluster into two clusters is done by trying to minimize the maximum of the radii of the resulting clusters.

All the nodes in a cluster periodically sends heartbeats to each of its cluster peers. This is used to detect node failures, and to update pair-wise latency information for nodes. If a cluster leader fails or deliberately leaves the NICE group, a new leader is chosen by the same heuristic (minimize maximum latency from new center to any cluster peer). A new leader may also be chosen if the pair-wise latencies in the cluster drift sufficiently far to make selecting a new leader justified. Also, each member of every layer i periodically probes its latency to the nodes in layer i+1 — if the node is closer to another i+1 layer node, it moves to the corresponding cluster in layer i.


I didn’t see that the authors provided any grounds for choosing an appropriate k value (essentially the tree fan-in), which seems like it would be an important parameter.

1 Comment

Filed under Paper Summaries

“ExOr: Opportunistic Multi-Hop Routing for Wireless Networks”

Wired networks are traditionally viewed as a unicast medium: packets are sent from A to B. Because the underlying medium can have a degree of sharing (e.g. multiple hosts connected to the same Ethernet hub), steps must be taken to avoid interference between concurrent senders (e.g. CSMA).

In a wireless network, the medium is shared: any hosts in range of a radio transmission can receive it. If the goal is to use a traditional unicast routing protocol over a wireless link, this makes interference between senders a more challenging problem (e.g. as discussed in the MACAW paper). However, rather than viewing broadcast as an inconvenience we need to work around, the broadcast nature of wireless can also be leveraged to design new protocols that would be infeasible for a wired network.

A nice example of a broadcast-oriented protocol for wireless is “ExOr: Opportunistic Multi-Hop Routing for Wireless Networks“. In a traditional routing design, a sender chooses the next hop that should receive a packet, and then unicasts the packet to that destination. In ExOr, a sender broadcasts a batch of packets to a group of nodes simultaneously. The set of batch recipients coordinate to try to ensure that each packet in the batch is forwarded onward. The recipient of a packet is only chosen after the packet is sent, which allows ExOr to “opportunistically” take advantage of links that have high loss rates. ExOr assumes that node reception probabilities are mostly independent and mostly decrease with distance — both of which are probably pretty reasonable assumptions.


The source collects a batch of packets destined for the same host, and then chooses a forwarding list for the batch. The forwarding list is a list of nodes, sorted by the expected cost of delivering packets from that node to the eventual destination node. The cost metric is similar to ETX (expected number of transmissions required to send the packet to the destination via unicast, including retransmissions); unlike ETX, it only considers the forward delivery probability. While it would be possible to including all possible recipient nodes in the forwarding list, this would increase the coordination cost among the forwarders, so ExOr only includes “likely” recipients in the list (estimated 10%+ chance of receiving a broadcast packet).

Each packet contains a batch map, which holds the sender’s estimate of the highest priority (according to the cost metric) node to have received each packet in the batch. When a node receives a packet, it uses the packet’s batch map to update its local batch map. This means that batch map information propagates through the nodes, carrying information about packet reception from high priority nodes to lower priority nodes.

After the source broadcasts a batch, each member of the forwarding list broadcasts, ordered by descending priority (ETX value). Each node broadcasts the packets it received, along with its updated batch map. Nodes coordinate to schedule their transmissions to try to avoid interference (e.g. by estimating when each node’s expected transmission time is, according to ETX value and batch map contents). The protocol continues cycling through the nodes in priority order. At the end of each cycle, the ultimate destination broadcasts its batch map 10 times; at the beginning of each cycle, the source resends packets that weren’t received by any node (by observing batch map contents). The ExOr scheme stops when 90% of the packets in a batch have been transferred, and uses a traditional routing policy to deliver the remainder of the batch.


Overall, ExOr is a really neat idea. That said, it is at best a special-purpose solution, because of the high latency it incurs for batching and multiple transmission rounds. Similarly, the need for a split TCP proxy is pretty ugly. A complete solution to wireless routing would perhaps adaptively switch between latency-oriented and bandwidth-oriented routing techniques, depending on the nature of the traffic.

It seems arbitrary to me that ExOr works on a batch-by-batch basis—that almost seems like establishing a new TCP connection for each window’s worth of data. The amount of useful work done in each transmission round decreases, until the protocol imposes an arbitrary cutoff at 90% and switches to a traditional routing protocol. Instead, wouldn’t it be more sensible for ExOr to allow new packets to be inserted into the batch as the destination confirms the delivery of packets? This would essential emulate the “conservation of packets” principle from the Van Jacobson paper.


Filed under Paper Summaries