Tag Archives: MapReduce

“BotGraph: Large Scale Spamming Botnet Detection”

Botnets are used for various nefarious ends; one popular use is sending spam email by creating and then using accounts on free webmail providers like Hotmail and Google Mail. In the past, CAPTCHAs have been used to try to prevent this, but they are increasingly ineffective. Hence, the BotGraph paper proposes an algorithm for detecting bot-created accounts by analyzing user access behavior. They describe the algorithm, its implementation with Dryad, and present experimental results from real-world Hotmail access logs.

Algorithm

BotGraph employs three different ideas for detecting automated users:

  1. They regard sudden spikes in the number of accounts created by a single IP as suspicious. Hence, they use a simple exponentially-weighted moving average (EWMA) to detect such spikes, and throttle/rate-limit account signups from suspicious IPs. This has the effect of making it more difficult for spammers to obtain webmail accounts.
  2. They argue that the number of bot machines will be much smaller than the number of bot-created webmail accounts; hence, one bot machine will access a large number of accounts. They also argue that a single bot-created webmail account will be accessed from multiple bots on different autonomous systems (ASs), due to churn in the botnet (although this seems pretty unconvincing to me), and the fact that rate-limiting makes it more difficult to create large numbers of bot accounts. Hence, they look for pairs of user accounts that had logins from an overlapping set of ASs.
  3. Finally, they consider a user’s email-sending behavior:

    Normal users usually send a small number of emails per day on average, with different email sizes. On the other hand, bot-users usually send many emails per day, with identical or similar email sizes

    Hence, they regard users who send 3+ emails per day as “suspicious”; they also regard as suspicious users whose email-size distributions are dissimilar from most other users.

They use feature #1 primarily to rate-throttle new account creations. Feature #3 is used to avoid false positives.

Feature #2 is the primary focus of the paper. They construct a user-user graph with a vertex for each user account. Each edge has a weight that gives the number of shared login ASs — that is, the number of ASs that were used to login to both accounts. Within the user-user graph, they look for connected components with an edge weight over a threshold T: they begin by finding components with T=2, and then iteratively increasingly the threshold until each component has no more than 100 members.

Implementation

They describe two ways to implement the construction of the user-user graph using a data-parallel system like MapReduce or Dryad, using the login log from Hotmail (~220GB for one month of data):

  1. Partition the login records by client IP. Emit an intermediate record (i, j, k) for each shared login on the same day from AS k to accounts i and j. In the reduce phase, group on (i, j) and sum. The problem with this approach is that it requires a lot of communication: most edges in the user-user graph have weight 1, and hence can be dropped, but this approach still requires sending them over the network.
  2. Partition the login records by user name. For each partition, compute a “summary” of the IP-day keys present for users in that partition (the paper doesn’t specify the nature of the summary, but presumably it is analogous to a Bloom filter). Each partition sends its summary to every other partition. Using the summaries, each partition can exchange login records with other partitions in a way that allows edge weights to be computed, but doesn’t require sending weight 1 edges over the network.

They argue that the second method can’t be implemented with Map and Reduce, although I’m not sure if I believe them: multicasting can be done by writing to HDFS, as can shipping data between logical partitions.

Discussion

I think the major problem with their experimental results is that there’s effectively no adversary: botnet operators presumably weren’t aware of this technique when the experiments were performed. Hence, they haven’t adapted their tactics — which might actually be quite easy to do.

For example, it seems like it would be quite easy to defeat their EWMA-based throttling by simply increasing the number of signups/time gradually. Essentially, the bot machine acts like an HTTP proxy with a gradually-increasing user population. One can imagine such a bot even mimicking the traffic patterns exhibited by a real-world proxy (e.g. increase at 9AM, decrease at 5PM). Certainly using a simple EWMA seems too primitive to defeat a dedicated adversary.

Similarly, it also seems quite easy to avoid sharing a single webmail account among multiple botnets: simply assign a single webmail account to a single bot machine, and don’t reuse webmail accounts if the bot machine becomes inaccessible. The idea, again, is to simulate an HTTP proxy that accesses a large number of webmail accounts. The paper’s argument that “churn” requires reuse of webmail accounts “to maximize bot-account utilization” is unconvincing and unsubstantiated. Since this is the entire principle upon which their technique is based, I’d be quite concerned that a relatively simple adaptation on the part of botnet operators would make this analysis ineffective.

I thought the paper’s wide-eyed tone toward using MapReduce-style systems for graph algorithms was annoying. Lots of people do large-scale graph algorithms using MapReduce-style systems; in fact, that’s one of the main things MapReduce was originally designed for (e.g. computing PageRank). The paper is not novel in this respect, and I was surprised that they didn’t cite one of the many prior papers on this subject.

1 Comment

Filed under Paper Summaries

MapReduce vs. Parallel DBs

A Comparison of Approaches to Large-Scale Data Analysis” in SIGMOD 2009 is a followup to Stonebraker and DeWitt’s controversial blog posts (1, 2) comparing MapReduce with parallel databases for analyzing large volumes of data. Unsurprisingly, the paper makes the argument that parallel DBs significantly outperform Hadoop for a broad class of data analysis queries, although Hadoop loads the data much faster and was easier to configure.

Overall, I thought the paper’s analysis was pretty fair, but the authors clearly engage in some trickery to accentuate the performance differences between Hadoop and parallel DBs.

Loading Time

When reporting results, the authors separate load time from query time. Looking closely, Hadoop’s load performance is much faster than the load performance of either of the database systems (about 5x faster than Vertica at scale and 10x faster than DBMS-X). The load times are significant in magnitude, as well: on the UserVisits data set with 100 nodes, Hadoop loads the data in 4250 seconds, while Vertica takes ~21,000 seconds. That leaves plenty of time for Hadoop to execute all the other benchmark queries before Vertica has even finished loading the data! For data sets where the user is only interested in a single analysis run on the data (or perhaps wants to compute an aggregate on the raw data and then apply most analysis to the reduced data set), Hadoop has a massive advantage.

Indexes

Why the disparity in load times? A major reason is that the DBs transform the input into their native data format and build indexes, whereas Hadoop does not. Because load times are separated from query times, this essentially means that the database systems are given the chance to precompute answers to the queries, while Hadoop is not. Further, the authors chose several queries that benefit significantly from indexes. For example, the “Selection Task” retrieves only 36,000 out of 18 million records at each node (by applying a filter on an indexed column). In the Join Task, a filter is satisfied by only 134,000 of the 155 million records in the UserVisits table (again, both databases happen to have an index on the appropriate column).

Is this unfair? Well, databases can use indexes, while Hadoop currently cannot; it might also be non-trivial to integrate indexes into Hadoop’s model of user-defined map and reduce functions. For these particular queries, it would be easy to workaround this: given the huge disparity in loading times, a Hadoop user could precompute the set of rows that match the indexable predicates and store those rows as separate HDFS files, yielding a massive performance improvement.

Partitioning

For both database systems, the two relations in the Join Task are partitioned on the join key (Rankings.pageURL = UserVisits.destURL). This allows the DBs to compute the join results locally on each node. Because Hadoop has no similar concept builtin, it must exchange much more data between nodes to execute the join. As with indexes, it would be possible for the user to implement partitioning in Hadoop by hand (Apache Hive provides a simple way to do this). It would be interesting to see DB performance for joins on a non-partitioning key, or Hadoop’s performance with manual partitioning.

Aggregation Architecture

In the Aggregation Task, Hadoop is configured differently from the DBs. Hadoop uses a reducer per node, whereas the DBs compute per-node partial aggregates that are combined by a single coordinator node. For 100 nodes, that means Hadoop needs to shuffle data between 100×100 nodes, whereas the DBs need to only move 100 partial aggs to a single node. However, the Hadoop approach would likely scale better if there were a vast number of distinct groups (too many for the DB’s coordinator node to hold in memory). The paper should probably have examined Hadoop performance with a much smaller number of reducers for this query.

Conclusion

The authors probably should have been more forthright about these differences between the two systems.

If we consider these tricks, the paper’s conclusion becomes more nuanced. Rather than describing the massive performance advantage that parallel DBs have over Hadoop, the argument is more about ease of use: DBs provide features like indexes, materialized views and partitioning that make it easier to get good performance. Do you really want to implement these features by hand, for each analysis task? Probably not.

That said, I think there’s value in this paper for the Hadoop community: each of the techniques described above is worth investigating to improve Hadoop performance.

3 Comments

Filed under Map Reduce