This paper presents an empirical study of BGP misconfigurations, based on sampling the BGP traffic seen by 19 different ASs. The paper focuses on two types of misconfigurations:
- origin misconfigurations occur when a mistaken route origin is inserted (which might lead to directing traffic to the wrong AS), and
- export misconfigurations occur when a router mistakenly exports a route (which might lead to directing more traffic to an AS than intended).
The authors approximate misconfigurations by looking for BGP updates that are revoked quickly (typically because the error is discovered by an operator). To differentiate between misconfigurations and short-lived legitimate changes (e.g. for load balancing), the authors conducted an email survey of the network operators involved. To detect export misconfigurations, they applied a heuristic (Gao’s algorithm) to infer the relationships between ASs — once inter-AS relationships are known, export misconfigurations are clear.
The authors document a significant rate of misconfiguration: even though their analysis of misconfigurations is conservative, they found that almost 75% of new route announcements are the result of misconfiguration. Despite the frequent occurrence of misconfigurations, the authors found that these problems had relatively little impact on connectivity: for example, they found that about 25 BGP misconfigurations per day resulted in connectivity loss, compared with 1000 instances per day of connectivity loss due to failures. I think this is unsurprising: the problem with BGP misconfiguration is not that it results in routine connectivity loss, but rather that it encourages occasional catastrophic global connectivity problems (e.g. the YouTube in Pakistan incident).
When the paper mentions that characterizing a BGP misconfiguration is hard to do precisely, one wonders whether it is possible to differentiate automatically between valid and invalid route updates (e.g. via machine learning). Would it be possible to train a model to predict the likelihood that a proposed route updated is invalid, and to estimate the “risk” to network connectivity that would be incurred by applying the update? At the very least, such a tool would be useful to help operators identify the possible cause of a routing failure (that is, a recent high-risk route update that was applied); routers might even require operator intervention before agreeing to apply high-risk updates. Some searching reveals papers by Wu and Feng and Li et al. that apply ML to detecting abnormal / high-risk BGP events.
I liked how the paper made a number of sensible suggestions to reduce the rate of BGP misconfigurations. It is remarkable that many of their suggestions don’t require significant technological changes: by tweaking the user-interfaces used by network operators, the frequency of “slips” could be significantly reduced. However, there is a clear difference in rigor between the two halves of the paper: while the authors diagnose misconfigurations through careful analysis, their work to classify the causes of those misconfigurations seemed a little lazy in comparison (emailing network operators, and then (1) trusting their answers completely (2) generalizing the results of the email survey to operators who didn’t reply). The work could be improved by studying the human factors that contribute to misconfigurations more carefully.
Also, I wonder whether the same factors that contribute to routine BGP misconfigurations also contribute to the occasional catastrophic connectivity failures. The methodology here reminds me a bit of studying minor operator errors at a nuclear reactor, and assuming that preventing minor errors will also contribute to preventing catastrophic meltdowns. That is probably mostly true, but this work would be well-complemented by an empirical study of some notable recent examples of global routing failures.