“X-Trace: A Pervasive Network Tracing Framework” provides a tool for understanding the behavior of distributed systems composed of layers of protocols. Traditional logging and diagnostic tools operate at a single layer in the stack, for example by tracing the flow of HTTP or TCP traffic in a network. This is insufficient for understanding many realistic failure scenarios, because application traffic typically traverses many different layers and protocols: when a client initiates an HTTP request, the following might happen:
- A DNS lookup is performed (via UDP, perhaps requiring recursive resolution)
- A TCP connection is established to the remote server, which requires transmitting IP packets and numerous Ethernet frames across different network links.
- The remote server handles the HTTP request, typically by running various application code and contacting multiple databases (e.g. a single Google search request is distributed to ~1000 machines)
- The HTTP response is returned to the client; the contents of the response may prompt the client to issue subsequent requests (e.g. additional HTTP requests to fetch resources like images and external CSS)
A failure at any point in this sequence can result in a failure of the original action—hence, diagnosis tools that attempt to operate at a single layer of the stack won’t provide a complete picture into the operation of a distributed system.
Design
X-Trace works by tagging all the operations associated with a single high-level task with the same task ID. By modifying multiple layers of the protocol stack to record and propagate task IDs, all the low-level operations associated with a high-level task can be reconstructed. Furthermore, X-Trace also allows developers to annotate causal relationships, which allows a “task tree” of related operations to be constructed.
X-Trace metadata must be manually embedded into protocols by developers; protocols typically provide “extension”, “option”, or annotation fields that can be used to hold X-Trace data. The “trace request” (tagging all the operations associated with a task) is done in-band, as part of the messages sent for the task. Data collection happens offline and out-of-band. This makes failure handling easier, and allows the resources required for data collection to be reduced (e.g. using batching and compression). The downside to offline data collection is that it makes prompt diagnosis of problems more difficult.
Discussion
Overall, I think this is a really nice paper. The idea is obviously useful, and it is nicely explained.
The system appears to only have a limited ability to track causal relationships. In particular, situations involving multiple clients modifying shared state don’t appear to be supported very well. For example, suppose that request A results in inserting a row into a database table. Request B aggregates over that table; based on the output, it then tries to perform some action, which fails. Clearly, requests A and B are causally related in some sense, but X-Trace wouldn’t capture this relationship. Extending X-Trace to support full causal tracking would be equivalent to data provenance.
It would be interesting to try to build a network-wide assertion checking utility on top of the X-Trace framework.