“A Policy-Aware Switching Layer for Data Centers”

Many data center networks are configured as a single Layer 2 network, with the switches configured into a spanning tree to avoid routing loops (see the SEATTLE paper for some more background on how this works). However, in practice, simply finding a Layer 2 path to the appropriate destination host is not enough: data centers also include various “middleboxes” (e.g. firewalls, load balancers, and SSL offloaders) that intercept and sometimes modify Ethernet frames in transit. To ensure that traffic is routed to middleboxes as appropriate, network administrators typically modify Layer 2 path selection. This has a number of problems:

  • Correctness: How can the administrator be assured that frames always traverse the appropriate sequence of middleboxes, even in the face of network churn and switch failures?
  • Flexibility: Changing the middlebox routing policy should be easy to do.
  • Efficiency: Frames should not be routed to middleboxes that aren’t interested in those frames.

To address these problems, “A Policy-Aware Switching Layer for Data Centers” proposes that Layer 2 routing policies be specified in a declarative language at a centralized administrative location. These policies are compiled into then compiled into a set of rules that are installed onto “policy switches” (pswitches). Pswitches make Layer 2 routing decisions by consulting a policy table, rather than the “destination MAC => switch port” mapping table used in traditional Layer 2 switches. New policies can be deployed by simply modifying the centralized configuration, which then disseminates them to the appropriate pswitches.

Policies take the form “At location X, send frames like Y to Z”, where Z is a sequence of middlebox types. Writing policies in terms of middlebox types, rather than talking about individual middlebox instances, makes it easier to reason about the correctness and how to handle failures without violating the policy. Policies are compiled into sets of rules, which take the form “For frames that arrive from hop X like Y, send to hop Z” — this should admit an efficient implementation.

To avoid sending frames to middleboxes that are not interested in them, middleboxes are removed from the physical network data path—instead, pswitches explicitly forward frames to middleboxes as required. The authors argue that because data center interconnects are relatively low latency, this doesn’t have a major performance impact.


Making this idea work in practical networks requires addressing a lot of details. For example:

  • To support load balancers, the right-hand side of a policy can be a list of middlebox types. The pswitch uses a hash partitioning scheme that is careful to send all the packets belonging to either direction of a flow to the same destination.
  • To allow this work to be deployed without modifying the entire network, policy-based routing is implemented by encapsulating frames: to implement policy-based routing, a frame is encapsulated in another frame with a destination MAC of the next hop (middlebox or server) required by the policy. Frames are decapsulated before being delivered to middleboxes.


I like this paper. There are only two obvious concerns I can see: the deployment difficulties in any attempt to redesign how switches operate, and the performance concerns of (1) rule table lookups rather than normal Layer 2 routing (2) the additional latency of traversing middleboxes that are off the network data path. Compared with a traditional software-based router, their software-based pswitch implementation achieves 82% throughput and adds an additional 16% latency. Middleboxes deployed with this architecture suffer a lot more overhead: 40% throughput and twice the latency. Its hard to say how the performance of a hardware-based pswitch implementation would compare with regular hardware-based routers.

The authors argue that the additional hop required to traverse a middlebox is acceptable, because data center networks are typically low latency. This seems backward to me: data center applications might actually want to take advantage of low latency links (e.g. see RAMCloud), and won’t like the network infrastructure adding additional latency overhead without a compelling reason.

The idea of taking middleboxes off the data path seems orthogonal to the paper’s main contribution (which is the idea of a high-level specification of Layer 2 routing invariants). It might be nice to separate these two ideas more distinctly.


Leave a comment

Filed under Paper Summaries

“Internet Indirection Infrastructure”

Internet Indirection Infrastructure” (i3) proposes a communication abstraction in which applications send packets to identifiers, rather than network addresses. Receivers register triggers that express their interest in packets sent to a particular identifier— the i3 infrastructure takes care of the routing. i3 is designed to be a general-purpose substrate, which can be used to build communication patterns including multicast, anycast, load balanced server selection, and host mobility.

Basic Model

To register interest in data, a receiver registers a trigger for the pair (id, addr) — this specifies that packets sent to identifier id should be routed to IP address addr. Multiple receivers for a single ID can be registered, in which case the packet is delivered to all matching trigger addresses. i3 generalizes this basic model in a few ways:

  • Inexact matching. Identifiers are divided into two pieces: an “exact match prefix” of k bits, and a suffix that is used to do longest-prefix matches. That is, a trigger matches an identifier if the first k bits match exactly, and if no other trigger has a longer prefix match against the rest of the identifier.
  • Identifier stacks. Rather than sending a packet to a single identifier, packets can instead be sent to a stack of identifiers. Similarly, triggers are generalized to mean “When a match for id is found, push a, b, … onto the stack.” i3 looks for a trigger that matches the first element of the stack; if no such trigger is found, the first element is popped from the stack. This continues until the stack is empty or a matching trigger is found; if a match is found, the trigger’s identifier stack is pushed onto the packet’s stack, and routing continues anew. i3 assumes that this process will eventually result in matching a trigger against an identifier holding an IP address. When that occurs, the packet’s data and the rest of the identifier stack is sent to the address.
  • Private triggers. This is not an i3-level concept, but a design pattern that i3-using applications often follow. A “public” trigger is a trigger on a well-known identifier, and is used to initiate a session with a server; “private” triggers are used to communicate between pairs of end hosts. A pair of private trigger identifiers might be exchanged via a service’s public trigger identifier, and then subsequent communication will proceed via the private triggers.


i3 is implemented on top of a distributed hash table (DHT); the authors use the Chord DHT in the paper. The DHT provides a service to find the node associated with a key; for the purposes of i3, this allows a packet sender to find the node that holds the triggers for the first identifier in the packet’s identifier stack (the DHT lookup is only done on the “exact match prefix” of the identifier). The target i3 node then finds the matching trigger, if any, by doing a longest-prefix match against trigger identifiers. Once a match is found, the packet is routed to the receiver address or addresses via IP, or another DHT lookup occurs (if the trigger contains an ID stack, not an address).

Senders cache the IP address of the i3 node responsible for each identifier. Hence, in the common case, routing an i3 packet requires two physical network traversals: one to get from the sender to the i3 node, and then another from the i3 node to the receiver.

Example Usage

i3 provides direct support for simple multicast; anycast is supported using inexact identifier matching, and appropriate choices for the inexact suffix of the identifier. Host mobility can be supported by simply updating a receiver’s trigger. Multicast trees can be implemented by constructing routing trees in i3, using the identifier stack feature (triggers that “invoke” triggers can essentially be used to construct an arbitrary graph, and to do recursive queries over graphs).


In the paper’s experiments, the latency to route the first packet to an identifier is about 4 times that of raw IP routing, because a Chord lookup must be done (after applying some standard Chord tricks to try to pick the “closest” Chord node among the alternatives as the ring is traversed). To reduce the latency required to route subsequent packets, i3 employs two techniques:

  • Caching the address of the i3 node that owns a given identifier, as discussed above.
  • A receiver generates k random trigger identifiers, and then chooses the identifier that it can reach via IP with the lowest latency.

By having each receiver generate 15-30 random triggers with this technique, the authors are able to reduce the i3 routing latency for subsequent packets send to an identifier to between 2 and 1.5 times more than raw IP.


Overall, I liked this paper: it is an elegant idea that is explained well. A cynical view of this paper is that it isn’t all that innovative or “deep”: it is just a thin logic layer on top of a DHT. i3 is basically a generalization of the name-to-value lookup service provided by the DHT.

Both performance and security are serious issues with this design, and the paper didn’t really convince me that their approach addresses these problems.

I would have liked to see a performance comparison between IP-level multicast and anycast with implementations built on i3.

It would be interesting to compare i3 with group communication systems (GCS) that provide totally-ordered broadcast messages. What features could you add to i3 to make it easier to write reliable distributed systems?

Leave a comment

Filed under Paper Summaries

“DNS Performance and the Effectiveness of Caching”

DNS Performance and the Effectiveness of Caching” follows the networking tradition of collecting a great data set, and then using that data to draw some interesting and non-obvious conclusions about network behavior (see the BGP misconfiguration paper for another example of this approach).


This work is based on three separate network traces (two collected at MIT, one at KAIST in Korea). The authors recorded all outgoing DNS queries and received DNS responses, as well as all TCP connection start (SYN) and end (FIN, RST) packets for TCP flows originated inside the studied network. Only accounting for outgoing TCP flows means that they missed the effect of network services that perform DNS lookups in response to client activity (e.g. spam detection that does DNS resolution), but they found this only accounted for 10% of all DNS lookups.

In addition to their trace data, the authors performed some simulations based on the traces to measure the impact of changing various parameters (changes to average TTL, and the effects of shared caches).


The authors argue that decreasing the TTL of DNS address (“A”) records wouldn’t have much impact on the overall effectiveness of caching, due to the Zipfian distribution of host name lookups, and the fact that many cache hits for A records are tightly clustered together in time. If the inter-reference interval for a record is greater than the record’s TTL, caching will be ineffective for that record. They found that popular host names are accessed frequently enough that even a small TTL is sufficient, and that unpopular host names are accessed so infrequently that even with a large TTL, caching is ineffective. TTL values of a few minutes are sufficient to obtain most of the benefits of caching. Using a low TTL can be useful, because it allows changes to be propagated quickly; perhaps more importantly, it also enables new functionality, like DNS-based server selection and DNS support for mobile computing.

For similar reasons, the authors found that sharing a DNS cache among a large body of clients has only marginal utility once the number of clients increases above 10 or 20 — that is, the hit rate of a shared cache for ~1300 users was only very slightly higher than the hit rate for a cache shared among 20 clients.

In contrast, the authors found that caching for name server (“NS”) records was important, because NS records change rarely, and effective NS caching reduces traffic on the root name servers.


The paper argues that:

It is likely that DNS behavior is closely linked to Web traffic patterns, since most wide-area traffic is Web-related and Web connections are usually preceded by DNS lookups.

Given streaming video and peer-to-peer file sharing services like BitTorrent, it is probably no longer true that, by bandwidth, “most wide-area traffic is Web-related.” (This may still be true of flows, however.)

I thought this paper was interesting enough, but compared to other papers we’ve covered this semester, I didn’t get that much out of it: the interesting conclusions about the effectiveness of TTLs could be summarized without reading the entire paper. That said, this paper does present a good case study of how to do research that is driven by empirical data.


Filed under Paper Summaries

“The Development of the Domain Name System”

The Development of the Domain Name System” describes the basic design of the DNS system, the motivations and design principles employed, and the lessons learned by the designers of the system.

DNS Design

DNS is a distributed, hierarchical database that maps domain names to values. The values associated with a domain name are a collection of resource records (RRs), each with a well-known type (e.g. host IP address). RRs also have a “class,” which specifies the “protocol family” of the RR (e.g. Internet). Given the dominance of IP, I’d expect classes to not be used in the modern DNS system.

DNS names are arranged into a variable-depth tree. The name of each node is given by concatenating the labels of the path from a node to the root, and separating the labels with periods. The tree is divided into “zones,” which are each controlled by a different organization. Each zone is a contiguous region of the tree, although a zone is typically a simple subtree. Administrative authority flows from the root of the tree to the leaves: the root zone controls the set of top-level domains (e.g. COM, NET), which in turn control the contents of their subtree. This allows individual organizations to administer portions of the tree autonomously.

DNS is composed of name servers, which host DNS data, and resolvers, which query the system to resolve domain names. Resolvers can either be individual client machines (e.g. part of libc), or a separate organization-wide resolver service (that might be combined with the organization’s name server). Resolvers can cache DNS lookups, to reduce resolution traffic (particularly on the servers that host the higher levels of the tree). Each DNS record has a “TTL” value that specifies how long it can be cached for. DNS resolvers typically use UDP, rather than TCP, which avoids the need for a TCP handshake before an address can be resolved.


Modern DNS is insecure, arcane, and overly complex. Why? In practice, it seems that the focus on “leanness” described in the paper hasn’t produced a simple and minimalistic system.

I wonder if the focus on extensibility in the DNS design is justified. In practice, DNS is a system for mapping host names to IP addresses, with some secondary functionality like maintaining various (often-inaccurate) administrative information about host names and the MX feature. While extensibility may have been important when the intended use of the system was unclear, it doesn’t seem to be a big win in the modern Internet — a simpler and more constrained design might be a better fit for modern DNS usage. Foregoing extensibility and constraining the core DNS functionality to be a key-value lookup service might have helped to simplify the system.

The lack of any consideration of security is notable, given DNS’s shabby security record.

The paper explains that DNS could have represented multiple resource records of the same type with a single multi-valued record. Their argument for using multiple records is:

The space efficiency of the single RR with multiple values was attractive, but the multiple RR option cut down the maximum RR size. This appeared to promise simpler dynamic update protocols, and also seemed suited to use in a limited-size datagram environment.

I think this is a backwards way to design systems, because it confuses physical layout decisions (e.g. space efficiency of storage) with logical data format decisions (e.g. whether to use multiple “rows” of the same type, or a single row with an array value).

The paper notes that in 1983, root name servers typically processed one query per second (although clients still observed poor response times, 500 milliseconds to 5 seconds in some cases). I couldn’t find much data on the query rates of modern root servers, but the website for the K root server has a graph of query rates which suggests the average load is 15,000-20,000 queries per second. Multiplied by 13 root servers, that is a significant query load (although note that the 13 “root servers” are hosted by far more than 13 physical machines, using techniques like anycast).

Related Reading

Paul Vixie’s article “What DNS Is Not” decries “innovators” who “misuse” the DNS system. He particularly dislikes using IP-based geolocation to use DNS to implement CDNs, for instance.

Leave a comment

Filed under Paper Summaries

Berkeley DB Lunch: Sailesh Krishnamurthy, Truviso

The Berkeley DB group holds a weekly lunch lecture series in the fall semester. This week’s speaker is Sailesh Krishnamurthy, one of the co-founders of Truviso (and a 2006 alum of the Berkeley DB group). I had the good fortune to work with Sailesh at Truviso, so I’m sure this will be a terrific talk. The abstract is below—this work on intelligently handling out-of-order data, and generalizing that to “corrections” of all kinds—was only beginning when I was finishing up at Truviso, but it seems really interesting.

Time: 1-2PM, November 6, 2009 (lunch starts at 12:30, the talk starts at 1)

Location: 606 Soda Hall, UC Berkeley

Title: ACDC – Analytics over Continuous and DisContinuous Streams


Streaming continuous analytics systems have emerged as key solutions for dealing with massive data volumes and demands for low latency. These systems have been heavily influenced by an assumption that data streams can be viewed as sequences of ordered data. The reality, however, is that streams are not continuous and disruptions of various sorts in the form of either big chunks of late arriving data or arbitrary failures are endemic. We argue, therefore, that stream processing needs a fundamental rethink and advocate a unified approach providing Analytics over Continuous and DisContinuous (ACDC) streams of data. Our approach is based on a simple insight – partially process independent runs of data and defer the consolidation of the associated partial results to when the results are actually used on an on demand basis. Not only does our approach provide the first real solution to the problem of data that arrives arbitrarily late, it also lets us solve a host of hard problems such as parallelism, recovery, transactional consistency and high availability that have been neglected by streaming systems. In this talk we describe the Truviso ACDC approach and outline some of the key technical arguments and insights behind it.

Speaker: Sailesh Krishnamurthy, Vice President of Technology and Founder, Truviso, Inc.

Dr. Sailesh Krishnamurthy, PhD is responsible for setting and driving the overall technical strategy and direction for the Truviso product and solution portfolio. In addition, he works in close collaboration with marketing, sales and engineering teams in managing the product and solution roadmap, performance engineering, and technology evangelism. Previously, he built and managed the initial engineering, services and support teams at Truviso. Sailesh is a leading authority in the field of enterprise data management with over a dozen published academic papers and several U.S. patents. Sailesh investigated the technical ideas at the heart of Truviso’s products as part of his doctoral research on stream query processing, earning a PhD. in Computer Science from UC Berkeley in 2006. Prior to graduate work at Berkeley, he worked at the Database Technology Institute at IBM Corporation where he designed and developed advanced features in IBM database products. Earlier, he worked on a Java virtual machine implementation at Netscape Communications. Sailesh has a Master’s degree in computer Science from Purdue University and a Bachelor’s degree in Electrical Engineering from the Birla Institute of Technology and Science in Pilani, India.

Thanks to Daisy Zhe Wang for organizing the DB seminar this semester.

Leave a comment

Filed under Uncategorized

“Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications”

This paper describes the design of Chord, one of the earlier and most popular academic DHT systems (according to one measure, the Chord paper is the most-cited CS paper from 2001, along with a raft of other DHT papers). Chord’s design focuses on simplicity and offering provable performance and correctness properties. Chord provides a single service: given a key, it returns the address of the node storing the key. Other practical issues, like replication, caching, naming, and cryptographic authentication should be provided by an application level that runs on top of Chord.

Basic Protocol

In Chord, nodes and keys are arranged in a circle (the “identifier circle“). Each node maintains the address of the successor of that node; maintaining the correctness of these successor pointers is the critical correctness property of the protocol. Each key k is owned by the first node whose identifier (e.g. hash of IP address) follows k in the identifier circle. Therefore, keys can be located by simple sequential search; the search terminates when it reaches the node that ought to own the key.

To improve search performance, Chord also maintains a finger table. The finger table is essentially a skip list: if there are N nodes in the DHT, each finger table has O(log N) entries, holding the address of a node 1/2, 1/4, 1/8, … of the way “forward” from the current node in the identifier circle. This essentially allows the search procedure to perform a binary search of the identifier circle, to quickly approach the “neighorhood” of the target key. Note that this means that each node maintains more information about its local neighborhood than about distant nodes, which makes sense.

Joining A Chord Group

To join a Chord group, a joining node begins by computing its predecessor and finger tables. It then joins the group by notifying its successor node, which includes the new node as its predecessor. Each node runs a “stabilization” protocol to maintain the correctness of the successor pointers: each node n periodically asks whether the predecessor of its successor node is n; if not, a new node has joined the group, so n sets its successor to the new intermediate node, and notifies that node to make n the node’s predecessor. Once the successor pointers are updated, the new node has joined the Chord group, and can be accessed by subsequent lookup queries. In the background, keys that are now owned by n will be transferred from n‘s successor, and finger tables will be updated lazily: updating fingers promptly is not actually important for good performance, because fingers are used only to quickly reach the right “neighborhood.”

Handling Node Failure

If a node fails, the data on the node may be lost. This is not Chord’s concern, because it delegates responsibility for replication to the application. Instead, Chord must ensure that the Chord ring is not “broken”: lookup queries can continue, despite the fact that the predecessor of the failed node has an invalid successor node.

Chord achieves that by having each node maintain a list of successors, rather than a single one. If a node notices that its successor has failed, it simply tries the next node in the list. The paper argues that since the failure probabilities of elements of the successor list can be considered to be independent, using a sufficiently large successor list provides arbitrary protection against node failures. The stabilization protocol above is extended to apply to the entire list of successors, not just the first one.


Overall, this is a great paper, and it deserves its reputation as such. It also shows the benefit of combining good systems work with a top-flight theoretician (Karger).

Despite the paper’s formal tone, I was confused by their talk of “correctness”: they never actually define what their correctness criteria are. Their notion of “correctness” is apparently that the Chord ring remains live and can continue answering queries, even if the query answers are incorrect or undetected ring partitions occur. For example, they observe that when a join occurs and before stabilization finishes,

the nodes in the affected region [may] have incorrect successor pointers, or keys may not yet have migrated to newly joined nodes, and the lookup may fail. The higher-layer software using Chord will notice that the desired data was not found, and has the option of retrying the lookup after a pause.

That is quite disconcerting, and would be an impediment to using Chord-like systems as a backend for a reliable storage service. Importantly, many applications would like to distinguish between “lookup failed because of stabilization problems” and “lookup failed because the key was definitely not in the Chord ring.” As presented, the protocol does not appear to allow this distinction.

Another troubling aspect of the paper is the author’s attitude toward network partitions, which they regard as a “pathological case.” I thought this was surprising, because (short-lived) network partitions are not that rare. In fact, one of the explicit benefits of loose-consistency systems like DHTs is their ability to tolerate network partitions gracefully. They sketch a few ideas on how to detect and heal partitions, but don’t address this problem in any real depth.

The paper doesn’t talk about an obvious optimization: making “distance” in the Chord ring correlate with distance within the underlying physical network, to reduce communication costs. This is explored in depth by subsequent work.

1 Comment

Filed under Paper Summaries

“Looking Up Data In P2P Systems”

Looking Up Data In P2P Systems” is a CACM article that provides a brief, high-level overview of research on DHT systems in the 2003 timeframe. Their focus is on Chord, Pastry, and similar systems, that implement a DHT via an overlay network, and are primarily intended for WAN environments.

The paper focuses on the “lookup problem“: searching for a data item in a distributed system without using centralized servers or a hierarchical server structure. The paper’s argument is that requiring centralized or privileged nodes inhibits scalability, because it load certain servers more than others. To avoid the need for centralized resources, P2P lookup systems typically rely on mapping lookup keys into a numeric ID space, typically via hashing. The lookup protocol then uses the local “neighbor” information cached at each node to successively navigate toward nodes that hold keys in the ID space that are “closer” to the lookup key, using a distance metric defined by each protocol. An important real-world optimization is to make the P2P system’s notion of “distance” correspond roughly to network distance, to reduce the number of physical network hops that must be traversed to perform a lookup. P2P lookup systems differ in the amount of state they keep about their neighbors, how the space of nodes is arranged (1-D ring, 2-D space, etc.), the distance metric used, whether the system attempts to protect against malicious nodes joining the group, and so forth.


P2P systems clearly have a role to play, but the hype around this topic has died down considerably since 2003. Modern DHTs are typically employed inside a handful of datacenters, and hence use different algorithms than the WAN-oriented protocols discussed by this paper (e.g. Dynamo is a well-known DHT for datacenter use designed by Amazon).

The problem with treating all the nodes in a P2P system as true “peers” is that it assumes that each node has approximately equal connectivity and resources. This is unlikely to be true, especially in a WAN environment. Hence the use of “SuperNodes” in Kazaa, and other practical systems: to achieve good performance, a P2P system should take advantage of nodes with better connectivity and resources. This amounts to constructing a hierarchical arrangement on-the-fly, but there’s nothing inherently “unscalable” about that. It is interesting that this line of P2P work explores the extreme “decentralized / symmetric” end of the spectrum, but I think most practical systems will incorporate some degree of asymmetry.

Related Reading

The Surprising Persistence of Peer-To-Peer discusses the author’s view that P2P research has remaining “surprisingly” relevant and active in recent years.


Filed under Paper Summaries