“A Comparison of Approaches to Large-Scale Data Analysis” in SIGMOD 2009 is a followup to Stonebraker and DeWitt’s controversial blog posts (1, 2) comparing MapReduce with parallel databases for analyzing large volumes of data. Unsurprisingly, the paper makes the argument that parallel DBs significantly outperform Hadoop for a broad class of data analysis queries, although Hadoop loads the data much faster and was easier to configure.
Overall, I thought the paper’s analysis was pretty fair, but the authors clearly engage in some trickery to accentuate the performance differences between Hadoop and parallel DBs.
When reporting results, the authors separate load time from query time. Looking closely, Hadoop’s load performance is much faster than the load performance of either of the database systems (about 5x faster than Vertica at scale and 10x faster than DBMS-X). The load times are significant in magnitude, as well: on the UserVisits data set with 100 nodes, Hadoop loads the data in 4250 seconds, while Vertica takes ~21,000 seconds. That leaves plenty of time for Hadoop to execute all the other benchmark queries before Vertica has even finished loading the data! For data sets where the user is only interested in a single analysis run on the data (or perhaps wants to compute an aggregate on the raw data and then apply most analysis to the reduced data set), Hadoop has a massive advantage.
Why the disparity in load times? A major reason is that the DBs transform the input into their native data format and build indexes, whereas Hadoop does not. Because load times are separated from query times, this essentially means that the database systems are given the chance to precompute answers to the queries, while Hadoop is not. Further, the authors chose several queries that benefit significantly from indexes. For example, the “Selection Task” retrieves only 36,000 out of 18 million records at each node (by applying a filter on an indexed column). In the Join Task, a filter is satisfied by only 134,000 of the 155 million records in the UserVisits table (again, both databases happen to have an index on the appropriate column).
Is this unfair? Well, databases can use indexes, while Hadoop currently cannot; it might also be non-trivial to integrate indexes into Hadoop’s model of user-defined map and reduce functions. For these particular queries, it would be easy to workaround this: given the huge disparity in loading times, a Hadoop user could precompute the set of rows that match the indexable predicates and store those rows as separate HDFS files, yielding a massive performance improvement.
For both database systems, the two relations in the Join Task are partitioned on the join key (Rankings.pageURL = UserVisits.destURL). This allows the DBs to compute the join results locally on each node. Because Hadoop has no similar concept builtin, it must exchange much more data between nodes to execute the join. As with indexes, it would be possible for the user to implement partitioning in Hadoop by hand (Apache Hive provides a simple way to do this). It would be interesting to see DB performance for joins on a non-partitioning key, or Hadoop’s performance with manual partitioning.
In the Aggregation Task, Hadoop is configured differently from the DBs. Hadoop uses a reducer per node, whereas the DBs compute per-node partial aggregates that are combined by a single coordinator node. For 100 nodes, that means Hadoop needs to shuffle data between 100×100 nodes, whereas the DBs need to only move 100 partial aggs to a single node. However, the Hadoop approach would likely scale better if there were a vast number of distinct groups (too many for the DB’s coordinator node to hold in memory). The paper should probably have examined Hadoop performance with a much smaller number of reducers for this query.
The authors probably should have been more forthright about these differences between the two systems.
If we consider these tricks, the paper’s conclusion becomes more nuanced. Rather than describing the massive performance advantage that parallel DBs have over Hadoop, the argument is more about ease of use: DBs provide features like indexes, materialized views and partitioning that make it easier to get good performance. Do you really want to implement these features by hand, for each analysis task? Probably not.
That said, I think there’s value in this paper for the Hadoop community: each of the techniques described above is worth investigating to improve Hadoop performance.