Scripts for writing papers

One of the nice things about LaTeX is that the definitive version of the document is stored as plain text. This simplifies version control and avoids the need to use a complicated editor with an opaque binary file format. It also means that writing scripts to check for writing mistakes is relatively easy. Of course, detecting grammatical errors in general is very challenging, but certain classes of mistakes are easy to check for. Matt Might has posted a few scripts which I’ve found quite useful.

I had cause to write another script to check for consistent use of hyphens. For example, both “non-deterministic” and “nondeterministic” are acceptable, but a single document should pick one variant and use it consistently. You can find the script here; suggestions for improvement (rather, patches) are welcome.

What other common mistakes are amenable to a simple script? Checking for consistent use of the Oxford comma should be straightforward, for example. There are also scripts like chktex that look for mistakes in LaTeX usage (although I’ve personally found chktex to be too noisy to be useful).

Naturally, there are limits on what you can do without parsing natural language. After The Deadline claims to provide some NLP-based grammar analysis, but I haven’t used it personally.


  • atdtool, a command-line interface to After The Deadline
  • diction, a GNU tool to check for common writing mistakes
  • LanguageTool, another open source style and grammar checker
  • TextLint, a writing style checker


Filed under Uncategorized

Too Good To Be Believed

In the (excellent) Sinfonia SOSP ’07 paper, the authors compare a group communication system (GCS) built using Sinfonia with the open source Spread GCS. Although I like the Sinfonia paper a lot, I thought this evaluation was actually detrimental to the paper. The authors present several graphs comparing the performance of SinfoniaGCS with Spread, such as this one:

Performance comparison of Spread and SinfoniaGCS

Clearly, SinfoniaGCS vastly outperforms Spread in this configuration. At first glance, this might seem like a great experiment: the authors have demonstrated that you can use Sinfonia to build a high-performance GCS, right?

To me, including this evaluation in the paper is not helpful, because it raises more questions than it answers. There are two possibilities:

  1. Spread’s performance is truly terrible. In that case, what are we to learn from comparing the performance of SinfoniaGCS with a worst-in-class alternative?
  2. Spread is misconfigured. The paper notes that Spread wasn’t configured to use IP broadcast or multicast, and that SinfoniaGCS was allowed to batch together 128 messages at a time; either change could have a huge performance impact. Again, there is little to learn from an apples-to-oranges comparison between a carefully tuned Sinfonia system and a misconfigured Spread system.

A convincing performance study would show that by using the Sinfonia infrastructure, one can build a GCS that approaches the performance of an optimized GCS written from scratch (and perhaps that using Sinfonia leads to a smaller/simpler GCS implementation). Alternatively, if using Sinfonia really does allow dramatically better performance, the reasons for the performance difference should be explored: Sinfonia is not magic, and if the authors could have identified some reasons for why traditional GCS designs perform poorly on modern datacenter networks, that would be an interesting result.

Instead, the authors merely speculate that using IP broadcast/multicast would improve Spread performance and leave it at that. Unfortunately, the result is a performance study from which we can learn very little.

Leave a comment

Filed under Research Notes

Benchmarking Dataflow Graphs in Ruby

Our group at UC Berkeley recently released Bloom, a new programming language for distributed computing. Specifically, we released an initial alpha of “Bud,” which is a Ruby DSL that lets you embed (distributed) declarative rules inside Ruby classes.

The current Bud prototype has a pretty naive evaluation scheme (deliberately); so I’ve been idly thinking about how to improve its performance. The standard way to evaluate Datalog rules is fairly similar to how a typical relational database evaluates queries: the system produces a dataflow graph (“query plan”), which consists of nodes (e.g., project, join or scan operators) connected by edges. The dataflow graph typically starts with the input relations (e.g., base tables, EDB for Datalog) and computes the derived relations (e.g., the user’s query in the case of a database, the IDB relations for Datalog). There is a lot more to building an efficient Datalog evaluator, but I’ll leave that for future posts.

To get good performance for Bloom programs, we’ll likely want to generate a dataflow graph. So how should we evaluate this graph? Two options come to mind:

  1. Represent nodes in the graph as Ruby objects and push tuples through the graph by evaluating Ruby.
  2. Represent nodes in the graph using C code and connect the Ruby query planner to the C runtime in some fashion (e.g., have the planner generate a description of the desired dataflow and parse that description in a C module, have the Ruby code generate C source code directly, or use something like LLVM to generate machine code at runtime).

Naturally the first option would be easier, but how does it perform? I wrote a quick and dirty microbenchmark to get a rough idea. The benchmark constructs a dataflow graph of 7 nodes: 4 “predicate” nodes that check whether the tuple matches a constant, 2 “join” nodes that probe a 6,000 element hash table (modeling half of a symmetric hash join), and 1 “sink” node that appends the tuple to an array (modeling storage into an in-memory relation). I ran 1 million tuples through the graph, and repeated this 10 times (dropping the first run to allow caches to warmup). I ran the benchmark on several Ruby implementations:

  • MRI 1.8.7-p334
  • MRI 1.9.2-p180
  • JRuby 1.6.0
  • Rubinius git checkout from April 11th (version string “1.2.4dev (1.8.7 73390817 yyyy-mm-dd JI)”)


The average time taken to route 1 million tuples through the dataflow graph, for each Ruby implementation:

I was surprised to see how much faster MRI 1.9 is than MRI 1.8. It is also interesting that JRuby does pretty decently.

So how should we interpret these results? Well, let’s compare the performance of the Ruby implementations with a more-or-less equivalent dataflow microbenchmark written in C:


After adding the C results, we can put the Ruby benchmark results into perspective: although recent Ruby implementations are considerably faster than MRI 1.8, they are still much slower (> 10x) than C for this particular microbenchmark. Is that surprising? Not really — Ruby is certainly not an ideal choice for high-performance, data-intensive computing. Despite recent progress in building faster Ruby implementations, the performance gap (for this microbenchmark) remains considerable.

Obviously, take these results with a grain of salt: many more factors would contribute to the performance of a complete system than a trivial microbenchmark like this. There are also factors that are difficult to measure in a small microbenchmark: for example, using a pure Ruby implementation would force the GC to manage the state stored in Bloom collections, which would result in additional performance overhead.

Additional Details

  • Benchmarks performed on a late 2010 Macbook Air (2.13Ghz Core 2), running OSX 10.6.7. The C benchmark was compiled with Apple’s GCC (4.2.1) at optimization level “-O2”, and was (dynamically) linked against APR 1.4.2.
  • Ruby source code
  • C source code
  • Raw benchmark data

1 Comment

Filed under Research Notes

“Not-a-Bot: Improving Service Availability in the Face of Botnet Attacks”

This paper focuses on distinguishing human-generated activity from bot-generated activity. Then, human-generated activity can be given preferential treatment (e.g. favorable routing of traffic, not being treated as spam). Their measure for distinguishing human-generated actions from machine-generated actions is pretty coarse and imprecise: an action is human-generated if it is preceded by keyboard or mouse input within a certain amount of time.

To implement this scheme, they go into considerable (exhaustive) detail about how to use the Trusted Computing Module (TPM) to build a trusted path between the physical input devices (keyboard, mouse) and a small piece of software called the attestor. To certify an action as human-generated, applications ask the attester for an attestation, passing a hash of the content to the attested for. If there has been user input within a predefined period, the attester returns a cryptographically-signed token that can be attached to the user action. When an upstream service receives the user action (e.g. HTTP request, email), it can verify the attestation by hashing the content of the action, and checking the cryptographic signature. Incorporating the content hash prevents an attestation for action x being used instead with action y. The verifier also needs to check that attestations are not reused, so the attester includes a nonce in the attestation token.

It is possible that a bot can monitor user actions, and submit malicious content to the attester whenever the user uses an input device. This would allow attestations to be created for malicious content, which means upstream software cannot blindly trust attested-for content. To reduce the impact of this attack, the paper suggests rate-limiting attestations to one per second.


I liked how the paper discussed an alternative approach to the same problem (having the attester track keyboard and mouse inputs, and then match that recorded history against the content that is to be attested, looking for a correspondence). Many papers present the solution they chose as the only alternative, when in fact it usually represents only one point in a much richer design space.

In some sense, the inverse of the proposed functionality would be more useful: i.e. being able to assert “this content was definitely bot-generated.” As proposed, it might be very hard for upstream services to make use of the certifications unless this idea saw widespread adoption. For example, suppose that 0.1% of your traffic is guaranteed by NAB to be human-generated. The remaining traffic may or may not be bot-generated, so you effectively can’t discriminate against it.

The paper suggests that, using this approach, human-generated content on a bot-infested machine can be effectively distinguished from bot-generated traffic. This seems pretty unlikely: the bot software can simply suppress all outgoing attestations (e.g. by installing a rootkit and interfering with the API used to request attestations), leaving upstream software in the same state as they would be without NAB.

Leave a comment

Filed under Paper Summaries

“BotGraph: Large Scale Spamming Botnet Detection”

Botnets are used for various nefarious ends; one popular use is sending spam email by creating and then using accounts on free webmail providers like Hotmail and Google Mail. In the past, CAPTCHAs have been used to try to prevent this, but they are increasingly ineffective. Hence, the BotGraph paper proposes an algorithm for detecting bot-created accounts by analyzing user access behavior. They describe the algorithm, its implementation with Dryad, and present experimental results from real-world Hotmail access logs.


BotGraph employs three different ideas for detecting automated users:

  1. They regard sudden spikes in the number of accounts created by a single IP as suspicious. Hence, they use a simple exponentially-weighted moving average (EWMA) to detect such spikes, and throttle/rate-limit account signups from suspicious IPs. This has the effect of making it more difficult for spammers to obtain webmail accounts.
  2. They argue that the number of bot machines will be much smaller than the number of bot-created webmail accounts; hence, one bot machine will access a large number of accounts. They also argue that a single bot-created webmail account will be accessed from multiple bots on different autonomous systems (ASs), due to churn in the botnet (although this seems pretty unconvincing to me), and the fact that rate-limiting makes it more difficult to create large numbers of bot accounts. Hence, they look for pairs of user accounts that had logins from an overlapping set of ASs.
  3. Finally, they consider a user’s email-sending behavior:

    Normal users usually send a small number of emails per day on average, with different email sizes. On the other hand, bot-users usually send many emails per day, with identical or similar email sizes

    Hence, they regard users who send 3+ emails per day as “suspicious”; they also regard as suspicious users whose email-size distributions are dissimilar from most other users.

They use feature #1 primarily to rate-throttle new account creations. Feature #3 is used to avoid false positives.

Feature #2 is the primary focus of the paper. They construct a user-user graph with a vertex for each user account. Each edge has a weight that gives the number of shared login ASs — that is, the number of ASs that were used to login to both accounts. Within the user-user graph, they look for connected components with an edge weight over a threshold T: they begin by finding components with T=2, and then iteratively increasingly the threshold until each component has no more than 100 members.


They describe two ways to implement the construction of the user-user graph using a data-parallel system like MapReduce or Dryad, using the login log from Hotmail (~220GB for one month of data):

  1. Partition the login records by client IP. Emit an intermediate record (i, j, k) for each shared login on the same day from AS k to accounts i and j. In the reduce phase, group on (i, j) and sum. The problem with this approach is that it requires a lot of communication: most edges in the user-user graph have weight 1, and hence can be dropped, but this approach still requires sending them over the network.
  2. Partition the login records by user name. For each partition, compute a “summary” of the IP-day keys present for users in that partition (the paper doesn’t specify the nature of the summary, but presumably it is analogous to a Bloom filter). Each partition sends its summary to every other partition. Using the summaries, each partition can exchange login records with other partitions in a way that allows edge weights to be computed, but doesn’t require sending weight 1 edges over the network.

They argue that the second method can’t be implemented with Map and Reduce, although I’m not sure if I believe them: multicasting can be done by writing to HDFS, as can shipping data between logical partitions.


I think the major problem with their experimental results is that there’s effectively no adversary: botnet operators presumably weren’t aware of this technique when the experiments were performed. Hence, they haven’t adapted their tactics — which might actually be quite easy to do.

For example, it seems like it would be quite easy to defeat their EWMA-based throttling by simply increasing the number of signups/time gradually. Essentially, the bot machine acts like an HTTP proxy with a gradually-increasing user population. One can imagine such a bot even mimicking the traffic patterns exhibited by a real-world proxy (e.g. increase at 9AM, decrease at 5PM). Certainly using a simple EWMA seems too primitive to defeat a dedicated adversary.

Similarly, it also seems quite easy to avoid sharing a single webmail account among multiple botnets: simply assign a single webmail account to a single bot machine, and don’t reuse webmail accounts if the bot machine becomes inaccessible. The idea, again, is to simulate an HTTP proxy that accesses a large number of webmail accounts. The paper’s argument that “churn” requires reuse of webmail accounts “to maximize bot-account utilization” is unconvincing and unsubstantiated. Since this is the entire principle upon which their technique is based, I’d be quite concerned that a relatively simple adaptation on the part of botnet operators would make this analysis ineffective.

I thought the paper’s wide-eyed tone toward using MapReduce-style systems for graph algorithms was annoying. Lots of people do large-scale graph algorithms using MapReduce-style systems; in fact, that’s one of the main things MapReduce was originally designed for (e.g. computing PageRank). The paper is not novel in this respect, and I was surprised that they didn’t cite one of the many prior papers on this subject.

1 Comment

Filed under Paper Summaries

“Cutting the Electric Bill for Internet-Scale Systems”

This paper begins with three observations:

  1. Energy-related costs are an increasingly large portion of total data center operating expenses.
  2. The cost of electricity can vary significantly between different times and between different regions at the same time.
  3. Many distributed systems already have the ability to dynamically route requests to different hosts and physical regions (for example, most CDNs try to minimize client latency and bandwidth costs by routing requests to a data center “close” to the client.)

Therefore, the paper investigates the feasibility of shifting load among replicas of a service that are located in different geographic regions, according to the current price of electricity in each region. For this to be effective, several things must be true:

  1. There must be significant variation in the price of electricity available in different regions at the same time.
  2. Data centers must be energy proportional: as the load on a data center is decreased by a factor of k, its energy usage should decrease by the same factor.
  3. Routing traffic to minimize the cost of electricity may result in increasing client latency and using more bandwidth (since cheap power might be far away from the client); the additional routers traversed might also use additional energy.

To answer the first question, the authors conduct a detailed empirical study of the cost of energy in different regions across the US, and compare that information with traffic logs from Akamai’s CDN. The authors use the Akamai traffic data to estimate how much the cost of electricity could be reduced by routing requests to the cheapest available electricity source, subject to various additional constraints.

The authors don’t do much to address the second question: they admit that the effectiveness of this technique depends heavily on energy-proportionality, but most computing equipment is not very energy-proportional (idle power consumption of a single system is typically ~60% of peak power usage, for example). Since energy-proportionality is the subject of much recent research, they express the hope that future hardware will be more energy-proportional. Finally, they carefully consider the impact of electricity-price-based routing on other optimization goals: for example, they consider only changing routes in a way that doesn’t result in any increased bandwidth charges (due to the “95-5” pricing scheme that most bandwidth providers use). A realistic implementation of this technique would consider electricity cost as one factor in a multi-variable optimization problem: we want to simultaneously minimize electricity cost, minimize client-perceived latency and minimize bandwidth charges, for example.

Summary of Results

The authors found significant asymmetry in electricity prices between geographic areas; furthermore, this asymmetry was dynamic (different regions were cheaper at different times). These are promising results for dynamic routing of requests based on electricity prices.

When cluster energy usage is completely proportional to load and bandwidth cost is not considered, price-sensitive routing can reduce energy costs by ~40%. The savings drop to only 5% if the energy-proportionality of current hardware is used, and the savings drop to a third of that if we are constrained to not increase bandwidth costs at all (assuming 95-5 pricing). Hence, this technique is only really effective if energy-proportional data centers are widely deployed.


I thought this was a great paper. The basic idea is simple, but their empirical study of the US electricity market was carefully done, and the results are instructive.

One interesting question is what would happen to the electricity market if techniques like these were widely deployed. Essentially, electricity consumption would become more price-elastic. When a given region offers a lower price, demand could move to that region quite quickly, which might act to drive up the price. Conversely, it would lower demand in higher-priced regions, lowering the price — and hence benefiting more inelastic energy consumers in that region.


Filed under Paper Summaries

“Scalable Reliable Multicast”

The SRM (Scalable Reliable Multicast) argues that, unlike reliable unicast protocols, different applications vary widely in their requirements for reliable multicast delivery. Hence:

One cannot make a single reliable multicast delivery scheme that simultaneously meets the functionality, scalability, and efficiency requirements of all applications.

Hence, they propose a model in which a generic skeleton is “fleshed out with application specific details.” The application provides a means to talk about how much data has been sent and received (“application data units”, unlike the connection-state oriented acknowledgments in TCP), a means for allocating bandwidth among the members of the group, and a means for individual nodes to decide how to allocate their local outgoing bandwidth. Following this model, their multicast framework only provides best-effort packet delivery, with possible duplication and reordering of packets — they believe that applications can build stronger reliable delivery and ordering properties on top of their framework, as needed. Best-effort multicast is actually implemented using IP multicast.

Unicast vs. Multicast

The authors point out two differences between reliable delivery for unicast vs. multicast that I thought were interesting:

  • In any reliable delivery protocol, one party must take responsibility for detecting lost data and retransmitting it. In unicast, either the sender or receiver can play this role equally well (TCP uses the sender, other protocols like NETBLT use the receiver). In multicast, sender-side delivery state is problematic: the sender must track the set of active recipients and the current state of each recipient, which is expensive, and difficult to do as the multicast group changes. In some sense, the whole point of multicast is to relieve the sender of that responsibility. Hence, they argue that receiver-side delivery state is better for multicast: group membership isn’t relevant, and the burden of keeping per-receiver state is avoided.
  • A reliable delivery protocol also needs a vocabulary to talk about how much data has been sent or received. Typical unicast protocols use a vocabulary based on communication state (typically either bytes or packets (segments)). This is not ideal for multicast, because a newly-joining recipient doesn’t share that communication state, and hence can’t interpret byte-oriented or packet-oriented messages. SRM instead argues for talking in terms of “application data units.”

wb Framework

The example multicast application in the paper is wb, a shared distributed whiteboard. This has the advantage that protocol commands (e.g. draw shape X at location Y) are mostly idempotent (if associated with a timestamp), so the underlying multicast protocol doesn’t need to enforce a total order on deliveries.

In wb, new operations are multicast to the entire group. When a receiver detects a loss (by detecting gaps in sequence numbers), it starts a “repair request” timer. The timer value is determined by the distance of the receiver from the original data source. When the timer expires, the recipient multicasts a “repair request” to the entire group, asking for the missing data. If a recipient sees another repair request for the same data before its timer expires, it suppresses its own repair request. Retransmission of missing data is handled similarly: when nodes receive repair requests for data they have seen, they start a response timer based on their distance from the repair request source. When the timer expires, they multicast the requested data to the entire group. Any other nodes that have a response timer for the requested data suppress their own timers. They also propose a bandwidth allocation scheme to divide the available bandwidth among new data operations and repair data.

Tuning the Request/Response Repair Timers

It is important that the repair request and repair response timers at different nodes be de-synchronized, to avoid redundant messages. The paper observes that certain topologies require methods for achieving de-synchronization: in a simple “chain” topology, seeding the timer with network distance is sufficient. For a “star” topology, all nodes are the same distance from the data source, so randomization must be used. A combination of these techniques must be used for a tree topology. Rather than requiring the timers be tuned for each individual network, they instead propose an adaptive algorithm that uses prior request/response behavior to tune the timers automatically.

Leave a comment

Filed under Paper Summaries