Benchmarking Dataflow Graphs in Ruby

Our group at UC Berkeley recently released Bloom, a new programming language for distributed computing. Specifically, we released an initial alpha of “Bud,” which is a Ruby DSL that lets you embed (distributed) declarative rules inside Ruby classes.

The current Bud prototype has a pretty naive evaluation scheme (deliberately); so I’ve been idly thinking about how to improve its performance. The standard way to evaluate Datalog rules is fairly similar to how a typical relational database evaluates queries: the system produces a dataflow graph (“query plan”), which consists of nodes (e.g., project, join or scan operators) connected by edges. The dataflow graph typically starts with the input relations (e.g., base tables, EDB for Datalog) and computes the derived relations (e.g., the user’s query in the case of a database, the IDB relations for Datalog). There is a lot more to building an efficient Datalog evaluator, but I’ll leave that for future posts.

To get good performance for Bloom programs, we’ll likely want to generate a dataflow graph. So how should we evaluate this graph? Two options come to mind:

  1. Represent nodes in the graph as Ruby objects and push tuples through the graph by evaluating Ruby.
  2. Represent nodes in the graph using C code and connect the Ruby query planner to the C runtime in some fashion (e.g., have the planner generate a description of the desired dataflow and parse that description in a C module, have the Ruby code generate C source code directly, or use something like LLVM to generate machine code at runtime).

Naturally the first option would be easier, but how does it perform? I wrote a quick and dirty microbenchmark to get a rough idea. The benchmark constructs a dataflow graph of 7 nodes: 4 “predicate” nodes that check whether the tuple matches a constant, 2 “join” nodes that probe a 6,000 element hash table (modeling half of a symmetric hash join), and 1 “sink” node that appends the tuple to an array (modeling storage into an in-memory relation). I ran 1 million tuples through the graph, and repeated this 10 times (dropping the first run to allow caches to warmup). I ran the benchmark on several Ruby implementations:

  • MRI 1.8.7-p334
  • MRI 1.9.2-p180
  • JRuby 1.6.0
  • Rubinius git checkout from April 11th (version string “1.2.4dev (1.8.7 73390817 yyyy-mm-dd JI)”)

Results

The average time taken to route 1 million tuples through the dataflow graph, for each Ruby implementation:

I was surprised to see how much faster MRI 1.9 is than MRI 1.8. It is also interesting that JRuby does pretty decently.

So how should we interpret these results? Well, let’s compare the performance of the Ruby implementations with a more-or-less equivalent dataflow microbenchmark written in C:

Conclusions

After adding the C results, we can put the Ruby benchmark results into perspective: although recent Ruby implementations are considerably faster than MRI 1.8, they are still much slower (> 10x) than C for this particular microbenchmark. Is that surprising? Not really — Ruby is certainly not an ideal choice for high-performance, data-intensive computing. Despite recent progress in building faster Ruby implementations, the performance gap (for this microbenchmark) remains considerable.

Obviously, take these results with a grain of salt: many more factors would contribute to the performance of a complete system than a trivial microbenchmark like this. There are also factors that are difficult to measure in a small microbenchmark: for example, using a pure Ruby implementation would force the GC to manage the state stored in Bloom collections, which would result in additional performance overhead.

Additional Details

  • Benchmarks performed on a late 2010 Macbook Air (2.13Ghz Core 2), running OSX 10.6.7. The C benchmark was compiled with Apple’s GCC (4.2.1) at optimization level “-O2”, and was (dynamically) linked against APR 1.4.2.
  • Ruby source code
  • C source code
  • Raw benchmark data
Advertisements

1 Comment

Filed under Research Notes

One response to “Benchmarking Dataflow Graphs in Ruby

  1. This concept is interesting.

    I’ve been curious about doing this to a distributed system, i.e. model the different parts of one’s system (Browser, App server, cache, data store, replication, etc) as nodes, assign computation costs to the nodes, assign network costs to the edges, and try to simulate and estimate the performance of one’s application.

    I’ve heard of Discrete Event System Simulation, but I don’t know if that’s applied to distributed system analysis.

    Are you aware of things like this? One could, of course, write up a simple program to do this, with the costs as configuration, but I’m wondering if there’s more to this problem?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s