This paper describes NetMedic, a system for automatically diagnosing the source of network failures. This work is distinguished from prior work on automatic error diagnosis because of its emphasis on detailed assignments of blame: isolating the individual process or configuration parameter that is the cause of the problem, rather than just the faulty machine. Due to the paper’s focus on small enterprise networks, in such a setting the faulty machine is usually clear, so a more precise diagnosis is necessary to be of any practical value.
The authors base their work on a survey of defect tickets describing network failures. They found that many errors are application-specific: error codes or incorrect behavior that arise when a particular operation is attempted, for example. Thus, a naive approach to detailed failure diagnose would require hard-coding knowledge of application semantics into the diagnosis tool. This is impractical, so instead NetMedic attempts to infer the relationships between network components by examining their joint behavior in the past.
Their technique proceeds as follows:
- A network is modeled as a set of components (e.g. machines, processes, network paths, etc.). Each component has a set of variables that define its current state, and a set of unidirectional dependencies on other components. NetMedic constructs the dependency graph automatically. An important point is that NetMedic makes no assumptions about the semantics of these variables (it is “application agnostic”).
- NetMedic records the state of each component over time. It infers correlations between the states of two components in the past: e.g. when C1_x was “abnormal” in the past, C2_y was also abnormal.
- The strengths of these inter-variable correlations are used to label the edges of the dependency graph, and hence to estimate the likelihood that one component influences the state of another component. Looking at this weighted graph allows us to infer which component is causing the undesirable behavior/failure.
A typical problem with this kind of analysis is confusing correlation with causation: just because C1_x and C2_y were both abnormal at around the same time is not enough evidence to conclude that C1_x caused the abnormality of C2_y, or vice versa. In fact, both behaviors might even be caused by a third variable that is not accounted for by NetMedic. The paper argues that this is typically not a problem in practice, because their technique usually works in practice: “we find that the assumption [that correlation implies causation] holds frequently enough in practice to facilitate effective diagnosis.”
Another problem with this approach is that it cannot handle situations that have not occurred in the past. If our goal is ultimately to build reliable systems, a technique that presupposes a large database of prior failures seems somehow unsatisfactory. More importantly, this kind of diagnosis cannot account for catastrophic, unlikely events — and it is in precisely those situations that system support for failure diagnosis might be the most helpful.
Finally, it seems to me that the proposed technique wouldn’t handle complex relationships between components. For example, suppose that a component C fails catastrophically if its load rises above a certain point (say, 3 clients connected concurrently). If there are 1000 possible clients, then for any given prior failure, the state of C will not appear dependent on the vast majority of those client components. Essentially, the relationship between the clients and C might be more complex than can be accounted by for NetMedic.