“White Space Networking with Wi-Fi like Connectivity”

White Space Networking with Wi-Fi like Connectivity” discusses how to achieve wireless networking using so-called “white space”: unused portions of the UHF spectrum (approximately 512-698 Mhz). This is attractive, because using the UHF spectrum can allow wireless networking over a relatively long range (1 mile or greater, compared to 300-600 feet for 802.11n). UHF signals are also better able to penetrate walls and buildings in urban environments.

The difficulty with using UHF spectrum is that the spectrum has already been assigned for use by two “incumbents”: analog TV and wireless microphones (although the use of the spectrum by analog TV should be reduced since this paper was written, due to the recent conversion of analog TV to digital). Per FCC rules, non-incumbent use of the UHF spectrum must avoid interfering with incumbent use. While TV signals are relatively stable, wireless microphone transmissions can begin without warning, which poses some challenges to providing wireless networking using the UHF spectrum. To gain improved throughput, the authors also suggest that multiple contiguous free channels should be aggregated together and used for networking.

To enable networking over UHF, several problems must be solved:

  • Spectrum Assignment: What portion of the 180 Mhz UHF band should be used for networking in a given locale? The UHF spectrum is typically “fragmented”: some portions of the spectrum are in use, while others are free. Furthermore, to increase throughput, multiple contiguous free channels should be used by a single network. The right spectrum to use will also change over time, e.g. as hosts move and wireless microphones are activated. The channel used must be free for both the AP and all clients, which is made challenging by the larger spacial extent of UHF networks.
  • AP Discovery: WiFi APs emit beacons every 100ms on the WiFi channel they are using. To find APs, clients can easily scan each channel, looking for AP beacons. This is more challenging for UHF, because there are more possibilities to scan (30 UHF channels, and 3 possible channel widths, yielding 84 combinations).
  • Handling Disconnections: If incumbent use of a channel is detected, networking transmissions must immediately stop using the channel (the authors found that even transmitting control packets can result in audible interference for wireless microphones using the channel). If clients and APs are prone to abruptly disconnect from a channel, a method is needed to choose another channel and continue communication.


To solve the spectrum assignment problem, clients and APs exchange information about which channels are free near them. APs probe for new channels when they detect incumbent use on their current channel, or if they observe a performance drop. Channel probing is done by considering which channels are free for all clients and the AP, and by then estimating the available bandwidth on each channel.

To enable AP discovery, the authors propose some signal processing magic. They propose a technique called “SIFT” which essentially allows them to scan the spectrum range and cheaply determine the presence of an AP on a channel; they can then tune the radio transceiver to the identified channel, and decode the beacon packet as usual.

To handle disconnections due to incumbent use of the spectrum, they propose using a separate 5Mhz backup channel, which is advertised as part of AP beacon packets. If a client senses that a disconnection has occurred (e.g. because no data packets have been received recently), it switches to the backup channel, and both listens for and emits chirps. Chirps contain information about the available white spaces near the chirping node, which is used to pick a new channel for communication. APs periodically scan for chirps.

It’s possible that the backup channel is already in use by another incumbent. In this case, the client picks an arbitrary available channel as the secondary backup channel, and emits chirps on it. The AP periodically scans all channels to attempt to find such chirps. The same signal processing magic (SIFT) that is used to enable cheap AP discovery is used to make scanning all channels periodically feasible.

Related Reading

Matt Welsh discusses this paper on his blog. The slides for the SIGCOMM talk on this paper are also available online.

Leave a comment

Filed under Paper Summaries

“Interactive Wi-Fi Connectivity for Moving Vehicles”

Interactive Wi-Fi Connectivity for Moving Vehicles” proposes ViFi, a system for achieving connectivity via WiFi from moving vehicles.

A moving vehicle is in range of different WiFi base stations over time. Due to host mobility, the best base station to use will change, so the client must do “handoff” to move between the base stations as appropriate. The authors describe traditional handoff policies as “hard handoff,” because a client is only ever connected to one base station at a time. The authors present experimental and analytical results to suggest that any handoff policy based on connecting to a single base station will be outperformed by policy which opportunistically utilizes multiple base stations that are in-range at once. Utilizing multiple base stations at once is effective for two reasons:

  1. A client is often in range of multiple base stations, at least in the urban environments they study.
  2. Packet loss is roughly independent of senders and receivers. In the upstream direction, utilizing multiple BSs is effective because losses are roughly independent of base stations: a packet that is not received by one BS is likely to be received by at least one another BS. In the downstream direction, BS diversity is effective because downstream losses tend to be “bursty” and path-dependent, but receiver independent: while one BS will often fail to deliver a burst of packets, other BSs are likely to be able to send packets to the receiver.


Therefore, the authors propose a scheme to allow WiFi clients to opportunistically utilize multiple in-range base stations. They assume that base stations have some ability to communicate, but do not have a high-bandwidth link. Hence, they focus on a scheme that allows the base stations to coordinate to enable base station diversity and avoids duplicate or missed packet deliveries, but requires minimal additional load on the inter-base station link and doesn’t hurt user-perceived latency. Their basic approach is to combine opportunistic reception (clients broadcast to all BSs in range) with probabilistic relaying (base stations compute relay probabilities independently and probabilistically).

In ViFi, each vehicle designates one nearby base station as the anchor, and the rest as auxiliaries. The anchor and auxiliary list is periodically broadcast by the vehicle via a beacon, which allows base stations to learn the identify and addresses of peer BSs.

Packet Transmission

The basic ViFi transmission algorithm is simple. To send a packet from the vehicle upstream, the receiver broadcasts to all in-range receivers. If the anchor receives the packet, it responds with an ACK broadcast. Any auxiliaries that overhear the packet broadcast wait for a short window; if they don’t see an ACK, they probabilistically relay the packet to the anchor via the inter-BS backplane. When the anchor receives relayed packets that it hasn’t already ACK’ed, it ACKs them. Finally, the receiver retransmits packets if it doesn’t receive an ACK within a certain period of time.

The transmission algorithm for downstream packets (base station => vehicle) works similarly (and almost symmetrically). The anchor broadcasts the packet to the vehicle. If the vehicle hears it, it broadcasts an ACK. If any auxiliaries overhear the broadcast but don’t see the ACK after a short window of time, they probabilistically broadcast the packet themselves. The anchor retransmits the packet if it doesn’t hear an ACK during the retransmission interval.

The magic is in the probabilistic retransmission technique, which must balance between retransmitting too often (wasting bandwidth) and not often enough (risking packet drops). The intuition behind their probabilistic retransmission technique is that they want the expected number of packet relays amongst all auxiliaries to be 1; amongst auxiliaries, they prefer transmission by auxiliaries that are better connected to the destination (i.e. either the anchor BS or the vehicle). Each BS computes packet reception probabilities between itself and the other BSs and the vehicle, using a periodic beacon.

Other Details

802.11 link-layer acknowledgments are sent immediately after a packet is delivered to the recipient, which allows the sender to predict the expected arrival time of the ACK. In ViFi, ACKs may be delayed due to relaying, which depends on current network conditions (e.g. behavior of the inter-BS backplane). Hence, ViFi uses its own acknowledgment scheme, with the retransmission interval set based on recent history (99th percentile measured delay for ACKs).

ViFi also implements salvage. When a vehicle moves out of range before the anchor BS can deliver packets for it, those packets are “stranded” on the old anchor BS. When the vehicle obtains a new anchor BS, the beacon it sends out also includes the previous anchor BS. The new anchor contacts the old anchor, and asks for any unacknowledged packets for the vehicle to be relayed.

Leave a comment

Filed under Paper Summaries


X-Trace: A Pervasive Network Tracing Framework” provides a tool for understanding the behavior of distributed systems composed of layers of protocols. Traditional logging and diagnostic tools operate at a single layer in the stack, for example by tracing the flow of HTTP or TCP traffic in a network. This is insufficient for understanding many realistic failure scenarios, because application traffic typically traverses many different layers and protocols: when a client initiates an HTTP request, the following might happen:

  1. A DNS lookup is performed (via UDP, perhaps requiring recursive resolution)
  2. A TCP connection is established to the remote server, which requires transmitting IP packets and numerous Ethernet frames across different network links.
  3. The remote server handles the HTTP request, typically by running various application code and contacting multiple databases (e.g. a single Google search request is distributed to ~1000 machines)
  4. The HTTP response is returned to the client; the contents of the response may prompt the client to issue subsequent requests (e.g. additional HTTP requests to fetch resources like images and external CSS)

A failure at any point in this sequence can result in a failure of the original action—hence, diagnosis tools that attempt to operate at a single layer of the stack won’t provide a complete picture into the operation of a distributed system.


X-Trace works by tagging all the operations associated with a single high-level task with the same task ID. By modifying multiple layers of the protocol stack to record and propagate task IDs, all the low-level operations associated with a high-level task can be reconstructed. Furthermore, X-Trace also allows developers to annotate causal relationships, which allows a “task tree” of related operations to be constructed.

X-Trace metadata must be manually embedded into protocols by developers; protocols typically provide “extension”, “option”, or annotation fields that can be used to hold X-Trace data. The “trace request” (tagging all the operations associated with a task) is done in-band, as part of the messages sent for the task. Data collection happens offline and out-of-band. This makes failure handling easier, and allows the resources required for data collection to be reduced (e.g. using batching and compression). The downside to offline data collection is that it makes prompt diagnosis of problems more difficult.


Overall, I think this is a really nice paper. The idea is obviously useful, and it is nicely explained.

The system appears to only have a limited ability to track causal relationships. In particular, situations involving multiple clients modifying shared state don’t appear to be supported very well. For example, suppose that request A results in inserting a row into a database table. Request B aggregates over that table; based on the output, it then tries to perform some action, which fails. Clearly, requests A and B are causally related in some sense, but X-Trace wouldn’t capture this relationship. Extending X-Trace to support full causal tracking would be equivalent to data provenance.

It would be interesting to try to build a network-wide assertion checking utility on top of the X-Trace framework.


Filed under Paper Summaries


One of the challenges faced in networking research and education is the difficulty of accurately recreating a network environment. Production routers are implemented as hardware devices, but designing and fabricating a new hardware design for research or teaching is very expensive. Modular software routers like Click enable research and education on routing in software, but don’t allow experimentation with hardware and physical-layer issues.

NetFPGA aims to rectify this, by providing an easy-programmable hardware platform for network devices (NICs, switches, routers, etc.). Users interact with NetFPGA remotely, uploading programs that configure FPGAs on the device appropriately, and then observing the results. The initial version of NetFPGA didn’t have a CPU: instead, software to control the device runs on another computer, and accesses hardware registers remotely (using special Ethernet packets that are decoded by the NetFPGA board). The second version of NetFPGA has an optional hardware CPU.


This seems like a pretty incontrovertible “good idea.” One potential question is how the performance of an FPGA-based router compares to that of realistic production hardware — I’d suspect that NetFPGA-routers occupy a middle ground between software-only routers (easy to develop, poor performance) and production routers (difficult to program and expensive to fabricate, but good performance).

Leave a comment

Filed under Paper Summaries

“A Policy-Aware Switching Layer for Data Centers”

Many data center networks are configured as a single Layer 2 network, with the switches configured into a spanning tree to avoid routing loops (see the SEATTLE paper for some more background on how this works). However, in practice, simply finding a Layer 2 path to the appropriate destination host is not enough: data centers also include various “middleboxes” (e.g. firewalls, load balancers, and SSL offloaders) that intercept and sometimes modify Ethernet frames in transit. To ensure that traffic is routed to middleboxes as appropriate, network administrators typically modify Layer 2 path selection. This has a number of problems:

  • Correctness: How can the administrator be assured that frames always traverse the appropriate sequence of middleboxes, even in the face of network churn and switch failures?
  • Flexibility: Changing the middlebox routing policy should be easy to do.
  • Efficiency: Frames should not be routed to middleboxes that aren’t interested in those frames.

To address these problems, “A Policy-Aware Switching Layer for Data Centers” proposes that Layer 2 routing policies be specified in a declarative language at a centralized administrative location. These policies are compiled into then compiled into a set of rules that are installed onto “policy switches” (pswitches). Pswitches make Layer 2 routing decisions by consulting a policy table, rather than the “destination MAC => switch port” mapping table used in traditional Layer 2 switches. New policies can be deployed by simply modifying the centralized configuration, which then disseminates them to the appropriate pswitches.

Policies take the form “At location X, send frames like Y to Z”, where Z is a sequence of middlebox types. Writing policies in terms of middlebox types, rather than talking about individual middlebox instances, makes it easier to reason about the correctness and how to handle failures without violating the policy. Policies are compiled into sets of rules, which take the form “For frames that arrive from hop X like Y, send to hop Z” — this should admit an efficient implementation.

To avoid sending frames to middleboxes that are not interested in them, middleboxes are removed from the physical network data path—instead, pswitches explicitly forward frames to middleboxes as required. The authors argue that because data center interconnects are relatively low latency, this doesn’t have a major performance impact.


Making this idea work in practical networks requires addressing a lot of details. For example:

  • To support load balancers, the right-hand side of a policy can be a list of middlebox types. The pswitch uses a hash partitioning scheme that is careful to send all the packets belonging to either direction of a flow to the same destination.
  • To allow this work to be deployed without modifying the entire network, policy-based routing is implemented by encapsulating frames: to implement policy-based routing, a frame is encapsulated in another frame with a destination MAC of the next hop (middlebox or server) required by the policy. Frames are decapsulated before being delivered to middleboxes.


I like this paper. There are only two obvious concerns I can see: the deployment difficulties in any attempt to redesign how switches operate, and the performance concerns of (1) rule table lookups rather than normal Layer 2 routing (2) the additional latency of traversing middleboxes that are off the network data path. Compared with a traditional software-based router, their software-based pswitch implementation achieves 82% throughput and adds an additional 16% latency. Middleboxes deployed with this architecture suffer a lot more overhead: 40% throughput and twice the latency. Its hard to say how the performance of a hardware-based pswitch implementation would compare with regular hardware-based routers.

The authors argue that the additional hop required to traverse a middlebox is acceptable, because data center networks are typically low latency. This seems backward to me: data center applications might actually want to take advantage of low latency links (e.g. see RAMCloud), and won’t like the network infrastructure adding additional latency overhead without a compelling reason.

The idea of taking middleboxes off the data path seems orthogonal to the paper’s main contribution (which is the idea of a high-level specification of Layer 2 routing invariants). It might be nice to separate these two ideas more distinctly.

Leave a comment

Filed under Paper Summaries

“Internet Indirection Infrastructure”

Internet Indirection Infrastructure” (i3) proposes a communication abstraction in which applications send packets to identifiers, rather than network addresses. Receivers register triggers that express their interest in packets sent to a particular identifier— the i3 infrastructure takes care of the routing. i3 is designed to be a general-purpose substrate, which can be used to build communication patterns including multicast, anycast, load balanced server selection, and host mobility.

Basic Model

To register interest in data, a receiver registers a trigger for the pair (id, addr) — this specifies that packets sent to identifier id should be routed to IP address addr. Multiple receivers for a single ID can be registered, in which case the packet is delivered to all matching trigger addresses. i3 generalizes this basic model in a few ways:

  • Inexact matching. Identifiers are divided into two pieces: an “exact match prefix” of k bits, and a suffix that is used to do longest-prefix matches. That is, a trigger matches an identifier if the first k bits match exactly, and if no other trigger has a longer prefix match against the rest of the identifier.
  • Identifier stacks. Rather than sending a packet to a single identifier, packets can instead be sent to a stack of identifiers. Similarly, triggers are generalized to mean “When a match for id is found, push a, b, … onto the stack.” i3 looks for a trigger that matches the first element of the stack; if no such trigger is found, the first element is popped from the stack. This continues until the stack is empty or a matching trigger is found; if a match is found, the trigger’s identifier stack is pushed onto the packet’s stack, and routing continues anew. i3 assumes that this process will eventually result in matching a trigger against an identifier holding an IP address. When that occurs, the packet’s data and the rest of the identifier stack is sent to the address.
  • Private triggers. This is not an i3-level concept, but a design pattern that i3-using applications often follow. A “public” trigger is a trigger on a well-known identifier, and is used to initiate a session with a server; “private” triggers are used to communicate between pairs of end hosts. A pair of private trigger identifiers might be exchanged via a service’s public trigger identifier, and then subsequent communication will proceed via the private triggers.


i3 is implemented on top of a distributed hash table (DHT); the authors use the Chord DHT in the paper. The DHT provides a service to find the node associated with a key; for the purposes of i3, this allows a packet sender to find the node that holds the triggers for the first identifier in the packet’s identifier stack (the DHT lookup is only done on the “exact match prefix” of the identifier). The target i3 node then finds the matching trigger, if any, by doing a longest-prefix match against trigger identifiers. Once a match is found, the packet is routed to the receiver address or addresses via IP, or another DHT lookup occurs (if the trigger contains an ID stack, not an address).

Senders cache the IP address of the i3 node responsible for each identifier. Hence, in the common case, routing an i3 packet requires two physical network traversals: one to get from the sender to the i3 node, and then another from the i3 node to the receiver.

Example Usage

i3 provides direct support for simple multicast; anycast is supported using inexact identifier matching, and appropriate choices for the inexact suffix of the identifier. Host mobility can be supported by simply updating a receiver’s trigger. Multicast trees can be implemented by constructing routing trees in i3, using the identifier stack feature (triggers that “invoke” triggers can essentially be used to construct an arbitrary graph, and to do recursive queries over graphs).


In the paper’s experiments, the latency to route the first packet to an identifier is about 4 times that of raw IP routing, because a Chord lookup must be done (after applying some standard Chord tricks to try to pick the “closest” Chord node among the alternatives as the ring is traversed). To reduce the latency required to route subsequent packets, i3 employs two techniques:

  • Caching the address of the i3 node that owns a given identifier, as discussed above.
  • A receiver generates k random trigger identifiers, and then chooses the identifier that it can reach via IP with the lowest latency.

By having each receiver generate 15-30 random triggers with this technique, the authors are able to reduce the i3 routing latency for subsequent packets send to an identifier to between 2 and 1.5 times more than raw IP.


Overall, I liked this paper: it is an elegant idea that is explained well. A cynical view of this paper is that it isn’t all that innovative or “deep”: it is just a thin logic layer on top of a DHT. i3 is basically a generalization of the name-to-value lookup service provided by the DHT.

Both performance and security are serious issues with this design, and the paper didn’t really convince me that their approach addresses these problems.

I would have liked to see a performance comparison between IP-level multicast and anycast with implementations built on i3.

It would be interesting to compare i3 with group communication systems (GCS) that provide totally-ordered broadcast messages. What features could you add to i3 to make it easier to write reliable distributed systems?

Leave a comment

Filed under Paper Summaries

“DNS Performance and the Effectiveness of Caching”

DNS Performance and the Effectiveness of Caching” follows the networking tradition of collecting a great data set, and then using that data to draw some interesting and non-obvious conclusions about network behavior (see the BGP misconfiguration paper for another example of this approach).


This work is based on three separate network traces (two collected at MIT, one at KAIST in Korea). The authors recorded all outgoing DNS queries and received DNS responses, as well as all TCP connection start (SYN) and end (FIN, RST) packets for TCP flows originated inside the studied network. Only accounting for outgoing TCP flows means that they missed the effect of network services that perform DNS lookups in response to client activity (e.g. spam detection that does DNS resolution), but they found this only accounted for 10% of all DNS lookups.

In addition to their trace data, the authors performed some simulations based on the traces to measure the impact of changing various parameters (changes to average TTL, and the effects of shared caches).


The authors argue that decreasing the TTL of DNS address (“A”) records wouldn’t have much impact on the overall effectiveness of caching, due to the Zipfian distribution of host name lookups, and the fact that many cache hits for A records are tightly clustered together in time. If the inter-reference interval for a record is greater than the record’s TTL, caching will be ineffective for that record. They found that popular host names are accessed frequently enough that even a small TTL is sufficient, and that unpopular host names are accessed so infrequently that even with a large TTL, caching is ineffective. TTL values of a few minutes are sufficient to obtain most of the benefits of caching. Using a low TTL can be useful, because it allows changes to be propagated quickly; perhaps more importantly, it also enables new functionality, like DNS-based server selection and DNS support for mobile computing.

For similar reasons, the authors found that sharing a DNS cache among a large body of clients has only marginal utility once the number of clients increases above 10 or 20 — that is, the hit rate of a shared cache for ~1300 users was only very slightly higher than the hit rate for a cache shared among 20 clients.

In contrast, the authors found that caching for name server (“NS”) records was important, because NS records change rarely, and effective NS caching reduces traffic on the root name servers.


The paper argues that:

It is likely that DNS behavior is closely linked to Web traffic patterns, since most wide-area traffic is Web-related and Web connections are usually preceded by DNS lookups.

Given streaming video and peer-to-peer file sharing services like BitTorrent, it is probably no longer true that, by bandwidth, “most wide-area traffic is Web-related.” (This may still be true of flows, however.)

I thought this paper was interesting enough, but compared to other papers we’ve covered this semester, I didn’t get that much out of it: the interesting conclusions about the effectiveness of TTLs could be summarized without reading the entire paper. That said, this paper does present a good case study of how to do research that is driven by empirical data.


Filed under Paper Summaries

“The Development of the Domain Name System”

The Development of the Domain Name System” describes the basic design of the DNS system, the motivations and design principles employed, and the lessons learned by the designers of the system.

DNS Design

DNS is a distributed, hierarchical database that maps domain names to values. The values associated with a domain name are a collection of resource records (RRs), each with a well-known type (e.g. host IP address). RRs also have a “class,” which specifies the “protocol family” of the RR (e.g. Internet). Given the dominance of IP, I’d expect classes to not be used in the modern DNS system.

DNS names are arranged into a variable-depth tree. The name of each node is given by concatenating the labels of the path from a node to the root, and separating the labels with periods. The tree is divided into “zones,” which are each controlled by a different organization. Each zone is a contiguous region of the tree, although a zone is typically a simple subtree. Administrative authority flows from the root of the tree to the leaves: the root zone controls the set of top-level domains (e.g. COM, NET), which in turn control the contents of their subtree. This allows individual organizations to administer portions of the tree autonomously.

DNS is composed of name servers, which host DNS data, and resolvers, which query the system to resolve domain names. Resolvers can either be individual client machines (e.g. part of libc), or a separate organization-wide resolver service (that might be combined with the organization’s name server). Resolvers can cache DNS lookups, to reduce resolution traffic (particularly on the servers that host the higher levels of the tree). Each DNS record has a “TTL” value that specifies how long it can be cached for. DNS resolvers typically use UDP, rather than TCP, which avoids the need for a TCP handshake before an address can be resolved.


Modern DNS is insecure, arcane, and overly complex. Why? In practice, it seems that the focus on “leanness” described in the paper hasn’t produced a simple and minimalistic system.

I wonder if the focus on extensibility in the DNS design is justified. In practice, DNS is a system for mapping host names to IP addresses, with some secondary functionality like maintaining various (often-inaccurate) administrative information about host names and the MX feature. While extensibility may have been important when the intended use of the system was unclear, it doesn’t seem to be a big win in the modern Internet — a simpler and more constrained design might be a better fit for modern DNS usage. Foregoing extensibility and constraining the core DNS functionality to be a key-value lookup service might have helped to simplify the system.

The lack of any consideration of security is notable, given DNS’s shabby security record.

The paper explains that DNS could have represented multiple resource records of the same type with a single multi-valued record. Their argument for using multiple records is:

The space efficiency of the single RR with multiple values was attractive, but the multiple RR option cut down the maximum RR size. This appeared to promise simpler dynamic update protocols, and also seemed suited to use in a limited-size datagram environment.

I think this is a backwards way to design systems, because it confuses physical layout decisions (e.g. space efficiency of storage) with logical data format decisions (e.g. whether to use multiple “rows” of the same type, or a single row with an array value).

The paper notes that in 1983, root name servers typically processed one query per second (although clients still observed poor response times, 500 milliseconds to 5 seconds in some cases). I couldn’t find much data on the query rates of modern root servers, but the website for the K root server has a graph of query rates which suggests the average load is 15,000-20,000 queries per second. Multiplied by 13 root servers, that is a significant query load (although note that the 13 “root servers” are hosted by far more than 13 physical machines, using techniques like anycast).

Related Reading

Paul Vixie’s article “What DNS Is Not” decries “innovators” who “misuse” the DNS system. He particularly dislikes using IP-based geolocation to use DNS to implement CDNs, for instance.

Leave a comment

Filed under Paper Summaries

Berkeley DB Lunch: Sailesh Krishnamurthy, Truviso

The Berkeley DB group holds a weekly lunch lecture series in the fall semester. This week’s speaker is Sailesh Krishnamurthy, one of the co-founders of Truviso (and a 2006 alum of the Berkeley DB group). I had the good fortune to work with Sailesh at Truviso, so I’m sure this will be a terrific talk. The abstract is below—this work on intelligently handling out-of-order data, and generalizing that to “corrections” of all kinds—was only beginning when I was finishing up at Truviso, but it seems really interesting.

Time: 1-2PM, November 6, 2009 (lunch starts at 12:30, the talk starts at 1)

Location: 606 Soda Hall, UC Berkeley

Title: ACDC – Analytics over Continuous and DisContinuous Streams


Streaming continuous analytics systems have emerged as key solutions for dealing with massive data volumes and demands for low latency. These systems have been heavily influenced by an assumption that data streams can be viewed as sequences of ordered data. The reality, however, is that streams are not continuous and disruptions of various sorts in the form of either big chunks of late arriving data or arbitrary failures are endemic. We argue, therefore, that stream processing needs a fundamental rethink and advocate a unified approach providing Analytics over Continuous and DisContinuous (ACDC) streams of data. Our approach is based on a simple insight – partially process independent runs of data and defer the consolidation of the associated partial results to when the results are actually used on an on demand basis. Not only does our approach provide the first real solution to the problem of data that arrives arbitrarily late, it also lets us solve a host of hard problems such as parallelism, recovery, transactional consistency and high availability that have been neglected by streaming systems. In this talk we describe the Truviso ACDC approach and outline some of the key technical arguments and insights behind it.

Speaker: Sailesh Krishnamurthy, Vice President of Technology and Founder, Truviso, Inc.

Dr. Sailesh Krishnamurthy, PhD is responsible for setting and driving the overall technical strategy and direction for the Truviso product and solution portfolio. In addition, he works in close collaboration with marketing, sales and engineering teams in managing the product and solution roadmap, performance engineering, and technology evangelism. Previously, he built and managed the initial engineering, services and support teams at Truviso. Sailesh is a leading authority in the field of enterprise data management with over a dozen published academic papers and several U.S. patents. Sailesh investigated the technical ideas at the heart of Truviso’s products as part of his doctoral research on stream query processing, earning a PhD. in Computer Science from UC Berkeley in 2006. Prior to graduate work at Berkeley, he worked at the Database Technology Institute at IBM Corporation where he designed and developed advanced features in IBM database products. Earlier, he worked on a Java virtual machine implementation at Netscape Communications. Sailesh has a Master’s degree in computer Science from Purdue University and a Bachelor’s degree in Electrical Engineering from the Birla Institute of Technology and Science in Pilani, India.

Thanks to Daisy Zhe Wang for organizing the DB seminar this semester.

Leave a comment

Filed under Uncategorized

“Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications”

This paper describes the design of Chord, one of the earlier and most popular academic DHT systems (according to one measure, the Chord paper is the most-cited CS paper from 2001, along with a raft of other DHT papers). Chord’s design focuses on simplicity and offering provable performance and correctness properties. Chord provides a single service: given a key, it returns the address of the node storing the key. Other practical issues, like replication, caching, naming, and cryptographic authentication should be provided by an application level that runs on top of Chord.

Basic Protocol

In Chord, nodes and keys are arranged in a circle (the “identifier circle“). Each node maintains the address of the successor of that node; maintaining the correctness of these successor pointers is the critical correctness property of the protocol. Each key k is owned by the first node whose identifier (e.g. hash of IP address) follows k in the identifier circle. Therefore, keys can be located by simple sequential search; the search terminates when it reaches the node that ought to own the key.

To improve search performance, Chord also maintains a finger table. The finger table is essentially a skip list: if there are N nodes in the DHT, each finger table has O(log N) entries, holding the address of a node 1/2, 1/4, 1/8, … of the way “forward” from the current node in the identifier circle. This essentially allows the search procedure to perform a binary search of the identifier circle, to quickly approach the “neighorhood” of the target key. Note that this means that each node maintains more information about its local neighborhood than about distant nodes, which makes sense.

Joining A Chord Group

To join a Chord group, a joining node begins by computing its predecessor and finger tables. It then joins the group by notifying its successor node, which includes the new node as its predecessor. Each node runs a “stabilization” protocol to maintain the correctness of the successor pointers: each node n periodically asks whether the predecessor of its successor node is n; if not, a new node has joined the group, so n sets its successor to the new intermediate node, and notifies that node to make n the node’s predecessor. Once the successor pointers are updated, the new node has joined the Chord group, and can be accessed by subsequent lookup queries. In the background, keys that are now owned by n will be transferred from n‘s successor, and finger tables will be updated lazily: updating fingers promptly is not actually important for good performance, because fingers are used only to quickly reach the right “neighborhood.”

Handling Node Failure

If a node fails, the data on the node may be lost. This is not Chord’s concern, because it delegates responsibility for replication to the application. Instead, Chord must ensure that the Chord ring is not “broken”: lookup queries can continue, despite the fact that the predecessor of the failed node has an invalid successor node.

Chord achieves that by having each node maintain a list of successors, rather than a single one. If a node notices that its successor has failed, it simply tries the next node in the list. The paper argues that since the failure probabilities of elements of the successor list can be considered to be independent, using a sufficiently large successor list provides arbitrary protection against node failures. The stabilization protocol above is extended to apply to the entire list of successors, not just the first one.


Overall, this is a great paper, and it deserves its reputation as such. It also shows the benefit of combining good systems work with a top-flight theoretician (Karger).

Despite the paper’s formal tone, I was confused by their talk of “correctness”: they never actually define what their correctness criteria are. Their notion of “correctness” is apparently that the Chord ring remains live and can continue answering queries, even if the query answers are incorrect or undetected ring partitions occur. For example, they observe that when a join occurs and before stabilization finishes,

the nodes in the affected region [may] have incorrect successor pointers, or keys may not yet have migrated to newly joined nodes, and the lookup may fail. The higher-layer software using Chord will notice that the desired data was not found, and has the option of retrying the lookup after a pause.

That is quite disconcerting, and would be an impediment to using Chord-like systems as a backend for a reliable storage service. Importantly, many applications would like to distinguish between “lookup failed because of stabilization problems” and “lookup failed because the key was definitely not in the Chord ring.” As presented, the protocol does not appear to allow this distinction.

Another troubling aspect of the paper is the author’s attitude toward network partitions, which they regard as a “pathological case.” I thought this was surprising, because (short-lived) network partitions are not that rare. In fact, one of the explicit benefits of loose-consistency systems like DHTs is their ability to tolerate network partitions gracefully. They sketch a few ideas on how to detect and heal partitions, but don’t address this problem in any real depth.

The paper doesn’t talk about an obvious optimization: making “distance” in the Chord ring correlate with distance within the underlying physical network, to reduce communication costs. This is explored in depth by subsequent work.

1 Comment

Filed under Paper Summaries