Tag Archives: failure diagnosis


X-Trace: A Pervasive Network Tracing Framework” provides a tool for understanding the behavior of distributed systems composed of layers of protocols. Traditional logging and diagnostic tools operate at a single layer in the stack, for example by tracing the flow of HTTP or TCP traffic in a network. This is insufficient for understanding many realistic failure scenarios, because application traffic typically traverses many different layers and protocols: when a client initiates an HTTP request, the following might happen:

  1. A DNS lookup is performed (via UDP, perhaps requiring recursive resolution)
  2. A TCP connection is established to the remote server, which requires transmitting IP packets and numerous Ethernet frames across different network links.
  3. The remote server handles the HTTP request, typically by running various application code and contacting multiple databases (e.g. a single Google search request is distributed to ~1000 machines)
  4. The HTTP response is returned to the client; the contents of the response may prompt the client to issue subsequent requests (e.g. additional HTTP requests to fetch resources like images and external CSS)

A failure at any point in this sequence can result in a failure of the original action—hence, diagnosis tools that attempt to operate at a single layer of the stack won’t provide a complete picture into the operation of a distributed system.


X-Trace works by tagging all the operations associated with a single high-level task with the same task ID. By modifying multiple layers of the protocol stack to record and propagate task IDs, all the low-level operations associated with a high-level task can be reconstructed. Furthermore, X-Trace also allows developers to annotate causal relationships, which allows a “task tree” of related operations to be constructed.

X-Trace metadata must be manually embedded into protocols by developers; protocols typically provide “extension”, “option”, or annotation fields that can be used to hold X-Trace data. The “trace request” (tagging all the operations associated with a task) is done in-band, as part of the messages sent for the task. Data collection happens offline and out-of-band. This makes failure handling easier, and allows the resources required for data collection to be reduced (e.g. using batching and compression). The downside to offline data collection is that it makes prompt diagnosis of problems more difficult.


Overall, I think this is a really nice paper. The idea is obviously useful, and it is nicely explained.

The system appears to only have a limited ability to track causal relationships. In particular, situations involving multiple clients modifying shared state don’t appear to be supported very well. For example, suppose that request A results in inserting a row into a database table. Request B aggregates over that table; based on the output, it then tries to perform some action, which fails. Clearly, requests A and B are causally related in some sense, but X-Trace wouldn’t capture this relationship. Extending X-Trace to support full causal tracking would be equivalent to data provenance.

It would be interesting to try to build a network-wide assertion checking utility on top of the X-Trace framework.


Filed under Paper Summaries

“Detailed Diagnose in Enterprise Networks”

This paper describes NetMedic, a system for automatically diagnosing the source of network failures. This work is distinguished from prior work on automatic error diagnosis because of its emphasis on detailed assignments of blame: isolating the individual process or configuration parameter that is the cause of the problem, rather than just the faulty machine. Due to the paper’s focus on small enterprise networks, in such a setting the faulty machine is usually clear, so a more precise diagnosis is necessary to be of any practical value.

The authors base their work on a survey of defect tickets describing network failures. They found that many errors are application-specific: error codes or incorrect behavior that arise when a particular operation is attempted, for example. Thus, a naive approach to detailed failure diagnose would require hard-coding knowledge of application semantics into the diagnosis tool. This is impractical, so instead NetMedic attempts to infer the relationships between network components by examining their joint behavior in the past.

Their technique proceeds as follows:

  • A network is modeled as a set of components (e.g. machines, processes, network paths, etc.). Each component has a set of variables that define its current state, and a set of unidirectional dependencies on other components. NetMedic constructs the dependency graph automatically. An important point is that NetMedic makes no assumptions about the semantics of these variables (it is “application agnostic”).
  • NetMedic records the state of each component over time. It infers correlations between the states of two components in the past: e.g. when C1_x was “abnormal” in the past, C2_y was also abnormal.
  • The strengths of these inter-variable correlations are used to label the edges of the dependency graph, and hence to estimate the likelihood that one component influences the state of another component. Looking at this weighted graph allows us to infer which component is causing the undesirable behavior/failure.


A typical problem with this kind of analysis is confusing correlation with causation: just because C1_x and C2_y were both abnormal at around the same time is not enough evidence to conclude that C1_x caused the abnormality of C2_y, or vice versa. In fact, both behaviors might even be caused by a third variable that is not accounted for by NetMedic. The paper argues that this is typically not a problem in practice, because their technique usually works in practice: “we find that the assumption [that correlation implies causation] holds frequently enough in practice to facilitate effective diagnosis.”

Another problem with this approach is that it cannot handle situations that have not occurred in the past. If our goal is ultimately to build reliable systems, a technique that presupposes a large database of prior failures seems somehow unsatisfactory. More importantly, this kind of diagnosis cannot account for catastrophic, unlikely events — and it is in precisely those situations that system support for failure diagnosis might be the most helpful.

Finally, it seems to me that the proposed technique wouldn’t handle complex relationships between components. For example, suppose that a component C fails catastrophically if its load rises above a certain point (say, 3 clients connected concurrently). If there are 1000 possible clients, then for any given prior failure, the state of C will not appear dependent on the vast majority of those client components. Essentially, the relationship between the clients and C might be more complex than can be accounted by for NetMedic.

1 Comment

Filed under Paper Summaries