Category Archives: Research Notes

Too Good To Be Believed

In the (excellent) Sinfonia SOSP ’07 paper, the authors compare a group communication system (GCS) built using Sinfonia with the open source Spread GCS. Although I like the Sinfonia paper a lot, I thought this evaluation was actually detrimental to the paper. The authors present several graphs comparing the performance of SinfoniaGCS with Spread, such as this one:

Performance comparison of Spread and SinfoniaGCS

Clearly, SinfoniaGCS vastly outperforms Spread in this configuration. At first glance, this might seem like a great experiment: the authors have demonstrated that you can use Sinfonia to build a high-performance GCS, right?

To me, including this evaluation in the paper is not helpful, because it raises more questions than it answers. There are two possibilities:

  1. Spread’s performance is truly terrible. In that case, what are we to learn from comparing the performance of SinfoniaGCS with a worst-in-class alternative?
  2. Spread is misconfigured. The paper notes that Spread wasn’t configured to use IP broadcast or multicast, and that SinfoniaGCS was allowed to batch together 128 messages at a time; either change could have a huge performance impact. Again, there is little to learn from an apples-to-oranges comparison between a carefully tuned Sinfonia system and a misconfigured Spread system.

A convincing performance study would show that by using the Sinfonia infrastructure, one can build a GCS that approaches the performance of an optimized GCS written from scratch (and perhaps that using Sinfonia leads to a smaller/simpler GCS implementation). Alternatively, if using Sinfonia really does allow dramatically better performance, the reasons for the performance difference should be explored: Sinfonia is not magic, and if the authors could have identified some reasons for why traditional GCS designs perform poorly on modern datacenter networks, that would be an interesting result.

Instead, the authors merely speculate that using IP broadcast/multicast would improve Spread performance and leave it at that. Unfortunately, the result is a performance study from which we can learn very little.

Leave a comment

Filed under Research Notes

Benchmarking Dataflow Graphs in Ruby

Our group at UC Berkeley recently released Bloom, a new programming language for distributed computing. Specifically, we released an initial alpha of “Bud,” which is a Ruby DSL that lets you embed (distributed) declarative rules inside Ruby classes.

The current Bud prototype has a pretty naive evaluation scheme (deliberately); so I’ve been idly thinking about how to improve its performance. The standard way to evaluate Datalog rules is fairly similar to how a typical relational database evaluates queries: the system produces a dataflow graph (“query plan”), which consists of nodes (e.g., project, join or scan operators) connected by edges. The dataflow graph typically starts with the input relations (e.g., base tables, EDB for Datalog) and computes the derived relations (e.g., the user’s query in the case of a database, the IDB relations for Datalog). There is a lot more to building an efficient Datalog evaluator, but I’ll leave that for future posts.

To get good performance for Bloom programs, we’ll likely want to generate a dataflow graph. So how should we evaluate this graph? Two options come to mind:

  1. Represent nodes in the graph as Ruby objects and push tuples through the graph by evaluating Ruby.
  2. Represent nodes in the graph using C code and connect the Ruby query planner to the C runtime in some fashion (e.g., have the planner generate a description of the desired dataflow and parse that description in a C module, have the Ruby code generate C source code directly, or use something like LLVM to generate machine code at runtime).

Naturally the first option would be easier, but how does it perform? I wrote a quick and dirty microbenchmark to get a rough idea. The benchmark constructs a dataflow graph of 7 nodes: 4 “predicate” nodes that check whether the tuple matches a constant, 2 “join” nodes that probe a 6,000 element hash table (modeling half of a symmetric hash join), and 1 “sink” node that appends the tuple to an array (modeling storage into an in-memory relation). I ran 1 million tuples through the graph, and repeated this 10 times (dropping the first run to allow caches to warmup). I ran the benchmark on several Ruby implementations:

  • MRI 1.8.7-p334
  • MRI 1.9.2-p180
  • JRuby 1.6.0
  • Rubinius git checkout from April 11th (version string “1.2.4dev (1.8.7 73390817 yyyy-mm-dd JI)”)

Results

The average time taken to route 1 million tuples through the dataflow graph, for each Ruby implementation:

I was surprised to see how much faster MRI 1.9 is than MRI 1.8. It is also interesting that JRuby does pretty decently.

So how should we interpret these results? Well, let’s compare the performance of the Ruby implementations with a more-or-less equivalent dataflow microbenchmark written in C:

Conclusions

After adding the C results, we can put the Ruby benchmark results into perspective: although recent Ruby implementations are considerably faster than MRI 1.8, they are still much slower (> 10x) than C for this particular microbenchmark. Is that surprising? Not really — Ruby is certainly not an ideal choice for high-performance, data-intensive computing. Despite recent progress in building faster Ruby implementations, the performance gap (for this microbenchmark) remains considerable.

Obviously, take these results with a grain of salt: many more factors would contribute to the performance of a complete system than a trivial microbenchmark like this. There are also factors that are difficult to measure in a small microbenchmark: for example, using a pure Ruby implementation would force the GC to manage the state stored in Bloom collections, which would result in additional performance overhead.

Additional Details

  • Benchmarks performed on a late 2010 Macbook Air (2.13Ghz Core 2), running OSX 10.6.7. The C benchmark was compiled with Apple’s GCC (4.2.1) at optimization level “-O2”, and was (dynamically) linked against APR 1.4.2.
  • Ruby source code
  • C source code
  • Raw benchmark data

1 Comment

Filed under Research Notes