Our group at UC Berkeley recently released Bloom, a new programming language for distributed computing. Specifically, we released an initial alpha of “Bud,” which is a Ruby DSL that lets you embed (distributed) declarative rules inside Ruby classes.
The current Bud prototype has a pretty naive evaluation scheme (deliberately); so I’ve been idly thinking about how to improve its performance. The standard way to evaluate Datalog rules is fairly similar to how a typical relational database evaluates queries: the system produces a dataflow graph (“query plan”), which consists of nodes (e.g., project, join or scan operators) connected by edges. The dataflow graph typically starts with the input relations (e.g., base tables, EDB for Datalog) and computes the derived relations (e.g., the user’s query in the case of a database, the IDB relations for Datalog). There is a lot more to building an efficient Datalog evaluator, but I’ll leave that for future posts.
To get good performance for Bloom programs, we’ll likely want to generate a dataflow graph. So how should we evaluate this graph? Two options come to mind:
- Represent nodes in the graph as Ruby objects and push tuples through the graph by evaluating Ruby.
- Represent nodes in the graph using C code and connect the Ruby query planner to the C runtime in some fashion (e.g., have the planner generate a description of the desired dataflow and parse that description in a C module, have the Ruby code generate C source code directly, or use something like LLVM to generate machine code at runtime).
Naturally the first option would be easier, but how does it perform? I wrote a quick and dirty microbenchmark to get a rough idea. The benchmark constructs a dataflow graph of 7 nodes: 4 “predicate” nodes that check whether the tuple matches a constant, 2 “join” nodes that probe a 6,000 element hash table (modeling half of a symmetric hash join), and 1 “sink” node that appends the tuple to an array (modeling storage into an in-memory relation). I ran 1 million tuples through the graph, and repeated this 10 times (dropping the first run to allow caches to warmup). I ran the benchmark on several Ruby implementations:
- MRI 1.8.7-p334
- MRI 1.9.2-p180
- JRuby 1.6.0
- Rubinius git checkout from April 11th (version string “1.2.4dev (1.8.7 73390817 yyyy-mm-dd JI)”)
I was surprised to see how much faster MRI 1.9 is than MRI 1.8. It is also interesting that JRuby does pretty decently.
After adding the C results, we can put the Ruby benchmark results into perspective: although recent Ruby implementations are considerably faster than MRI 1.8, they are still much slower (> 10x) than C for this particular microbenchmark. Is that surprising? Not really — Ruby is certainly not an ideal choice for high-performance, data-intensive computing. Despite recent progress in building faster Ruby implementations, the performance gap (for this microbenchmark) remains considerable.
Obviously, take these results with a grain of salt: many more factors would contribute to the performance of a complete system than a trivial microbenchmark like this. There are also factors that are difficult to measure in a small microbenchmark: for example, using a pure Ruby implementation would force the GC to manage the state stored in Bloom collections, which would result in additional performance overhead.
- Benchmarks performed on a late 2010 Macbook Air (2.13Ghz Core 2), running OSX 10.6.7. The C benchmark was compiled with Apple’s GCC (4.2.1) at optimization level “-O2”, and was (dynamically) linked against APR 1.4.2.
- Ruby source code
- C source code
- Raw benchmark data