Botnets are used for various nefarious ends; one popular use is sending spam email by creating and then using accounts on free webmail providers like Hotmail and Google Mail. In the past, CAPTCHAs have been used to try to prevent this, but they are increasingly ineffective. Hence, the BotGraph paper proposes an algorithm for detecting bot-created accounts by analyzing user access behavior. They describe the algorithm, its implementation with Dryad, and present experimental results from real-world Hotmail access logs.
BotGraph employs three different ideas for detecting automated users:
- They regard sudden spikes in the number of accounts created by a single IP as suspicious. Hence, they use a simple exponentially-weighted moving average (EWMA) to detect such spikes, and throttle/rate-limit account signups from suspicious IPs. This has the effect of making it more difficult for spammers to obtain webmail accounts.
- They argue that the number of bot machines will be much smaller than the number of bot-created webmail accounts; hence, one bot machine will access a large number of accounts. They also argue that a single bot-created webmail account will be accessed from multiple bots on different autonomous systems (ASs), due to churn in the botnet (although this seems pretty unconvincing to me), and the fact that rate-limiting makes it more difficult to create large numbers of bot accounts. Hence, they look for pairs of user accounts that had logins from an overlapping set of ASs.
- Finally, they consider a user’s email-sending behavior:
Normal users usually send a small number of emails per day on average, with different email sizes. On the other hand, bot-users usually send many emails per day, with identical or similar email sizes
Hence, they regard users who send 3+ emails per day as “suspicious”; they also regard as suspicious users whose email-size distributions are dissimilar from most other users.
They use feature #1 primarily to rate-throttle new account creations. Feature #3 is used to avoid false positives.
Feature #2 is the primary focus of the paper. They construct a user-user graph with a vertex for each user account. Each edge has a weight that gives the number of shared login ASs — that is, the number of ASs that were used to login to both accounts. Within the user-user graph, they look for connected components with an edge weight over a threshold T: they begin by finding components with T=2, and then iteratively increasingly the threshold until each component has no more than 100 members.
They describe two ways to implement the construction of the user-user graph using a data-parallel system like MapReduce or Dryad, using the login log from Hotmail (~220GB for one month of data):
- Partition the login records by client IP. Emit an intermediate record (i, j, k) for each shared login on the same day from AS k to accounts i and j. In the reduce phase, group on (i, j) and sum. The problem with this approach is that it requires a lot of communication: most edges in the user-user graph have weight 1, and hence can be dropped, but this approach still requires sending them over the network.
- Partition the login records by user name. For each partition, compute a “summary” of the IP-day keys present for users in that partition (the paper doesn’t specify the nature of the summary, but presumably it is analogous to a Bloom filter). Each partition sends its summary to every other partition. Using the summaries, each partition can exchange login records with other partitions in a way that allows edge weights to be computed, but doesn’t require sending weight 1 edges over the network.
They argue that the second method can’t be implemented with Map and Reduce, although I’m not sure if I believe them: multicasting can be done by writing to HDFS, as can shipping data between logical partitions.
I think the major problem with their experimental results is that there’s effectively no adversary: botnet operators presumably weren’t aware of this technique when the experiments were performed. Hence, they haven’t adapted their tactics — which might actually be quite easy to do.
For example, it seems like it would be quite easy to defeat their EWMA-based throttling by simply increasing the number of signups/time gradually. Essentially, the bot machine acts like an HTTP proxy with a gradually-increasing user population. One can imagine such a bot even mimicking the traffic patterns exhibited by a real-world proxy (e.g. increase at 9AM, decrease at 5PM). Certainly using a simple EWMA seems too primitive to defeat a dedicated adversary.
Similarly, it also seems quite easy to avoid sharing a single webmail account among multiple botnets: simply assign a single webmail account to a single bot machine, and don’t reuse webmail accounts if the bot machine becomes inaccessible. The idea, again, is to simulate an HTTP proxy that accesses a large number of webmail accounts. The paper’s argument that “churn” requires reuse of webmail accounts “to maximize bot-account utilization” is unconvincing and unsubstantiated. Since this is the entire principle upon which their technique is based, I’d be quite concerned that a relatively simple adaptation on the part of botnet operators would make this analysis ineffective.
I thought the paper’s wide-eyed tone toward using MapReduce-style systems for graph algorithms was annoying. Lots of people do large-scale graph algorithms using MapReduce-style systems; in fact, that’s one of the main things MapReduce was originally designed for (e.g. computing PageRank). The paper is not novel in this respect, and I was surprised that they didn’t cite one of the many prior papers on this subject.