Tag Archives: design principles

“The Development of the Domain Name System”

The Development of the Domain Name System” describes the basic design of the DNS system, the motivations and design principles employed, and the lessons learned by the designers of the system.

DNS Design

DNS is a distributed, hierarchical database that maps domain names to values. The values associated with a domain name are a collection of resource records (RRs), each with a well-known type (e.g. host IP address). RRs also have a “class,” which specifies the “protocol family” of the RR (e.g. Internet). Given the dominance of IP, I’d expect classes to not be used in the modern DNS system.

DNS names are arranged into a variable-depth tree. The name of each node is given by concatenating the labels of the path from a node to the root, and separating the labels with periods. The tree is divided into “zones,” which are each controlled by a different organization. Each zone is a contiguous region of the tree, although a zone is typically a simple subtree. Administrative authority flows from the root of the tree to the leaves: the root zone controls the set of top-level domains (e.g. COM, NET), which in turn control the contents of their subtree. This allows individual organizations to administer portions of the tree autonomously.

DNS is composed of name servers, which host DNS data, and resolvers, which query the system to resolve domain names. Resolvers can either be individual client machines (e.g. part of libc), or a separate organization-wide resolver service (that might be combined with the organization’s name server). Resolvers can cache DNS lookups, to reduce resolution traffic (particularly on the servers that host the higher levels of the tree). Each DNS record has a “TTL” value that specifies how long it can be cached for. DNS resolvers typically use UDP, rather than TCP, which avoids the need for a TCP handshake before an address can be resolved.


Modern DNS is insecure, arcane, and overly complex. Why? In practice, it seems that the focus on “leanness” described in the paper hasn’t produced a simple and minimalistic system.

I wonder if the focus on extensibility in the DNS design is justified. In practice, DNS is a system for mapping host names to IP addresses, with some secondary functionality like maintaining various (often-inaccurate) administrative information about host names and the MX feature. While extensibility may have been important when the intended use of the system was unclear, it doesn’t seem to be a big win in the modern Internet — a simpler and more constrained design might be a better fit for modern DNS usage. Foregoing extensibility and constraining the core DNS functionality to be a key-value lookup service might have helped to simplify the system.

The lack of any consideration of security is notable, given DNS’s shabby security record.

The paper explains that DNS could have represented multiple resource records of the same type with a single multi-valued record. Their argument for using multiple records is:

The space efficiency of the single RR with multiple values was attractive, but the multiple RR option cut down the maximum RR size. This appeared to promise simpler dynamic update protocols, and also seemed suited to use in a limited-size datagram environment.

I think this is a backwards way to design systems, because it confuses physical layout decisions (e.g. space efficiency of storage) with logical data format decisions (e.g. whether to use multiple “rows” of the same type, or a single row with an array value).

The paper notes that in 1983, root name servers typically processed one query per second (although clients still observed poor response times, 500 milliseconds to 5 seconds in some cases). I couldn’t find much data on the query rates of modern root servers, but the website for the K root server has a graph of query rates which suggests the average load is 15,000-20,000 queries per second. Multiplied by 13 root servers, that is a significant query load (although note that the 13 “root servers” are hosted by far more than 13 physical machines, using techniques like anycast).

Related Reading

Paul Vixie’s article “What DNS Is Not” decries “innovators” who “misuse” the DNS system. He particularly dislikes using IP-based geolocation to use DNS to implement CDNs, for instance.

Leave a comment

Filed under Paper Summaries

Numbers Everyone Should Know

When you’re designing a performance-sensitive computer system, it is important to have an intuition for the relative costs of different operations. How much does a network I/O cost, compared to a disk I/O, a load from DRAM, or an L2 cache hit? How much computation does it make sense to trade for a reduction in I/O? What is the relative cost of random vs. sequential I/O? For a given workload, what is the bottleneck resource?

When designing a system, you rarely have enough time to completely build two alternative designs to compare their performance. This makes two skills useful:

  1. Back-of-the-envelope analysis. This essentially means developing an intuition for the performance of different alternate designs, so that you can reject possible designs out-of-hand, or choose which alternatives to consider more carefully.
  2. Microbenchmarking. If you can identify the bottleneck operation for a given resource, then you can construct a micro-benchmark that compares the performance of different implementations of that operation. This works in tandem with your intuition: the more microbenchmarking you do, the better your intuition for system performance becomes.

Jeff Dean makes similar points in his LADIS 2009 keynote (which I unfortunately wasn’t able to attend). In particular, he gives a useful table of “Numbers Everyone Should Know” — that is, the cost of some fundamental operations:

Operation Time (nsec)
L1 cache reference 0.5
Branch mispredict 5
L2 cache reference 7
Mutex lock/unlock 25
Main memory reference 100
Compress 1KB bytes with Zippy 3,000
Send 2K bytes over 1 Gbps network 20,000
Read 1MB sequentially from memory 250,000
Roundtrip within same datacenter 500,000
Disk seek 10,000,000
Read 1MB sequentially from disk 20,000,000
Send packet CA -> Netherlands -> CA 150,000,000

Some useful figures that aren’t in Dean’s data can be found in this article comparing NetBSD 2.0 and FreeBSD 5.3 from 2005. Approximating those figures, we get:

Operation Time (nsec)
System call overhead 400
Context switch between processes 3000
fork() (statically-linked binary) 70,000
fork() (dynamically-linked binary) 160,000

Update: This recent blog post examines the question of system call and context switch overhead in more detail. His figures suggest the best-case system call overhead is now only ~60 nsec (for Linux on Nehelem), and that context switches cost about 30 microseconds (30,000 nsec) — when you account for the cost of flushing CPU caches, that is probably pretty reasonable.

In comparison, John Ousterhout’s RAMCloud project aims to provide end-to-end roundtrips for a key-value store in the same datacenter within “5-10 microseconds,” which would represent about a 100x improvement over the 500 microsecond latency suggested above.

The keynote slides are worth a glance: Dean talks about the design of “Spanner”, a next-generation version of BigTable he is building at Google. See also James Hamilton’s notes on the keynote, and Greg Linden’s commentary.


Filed under Uncategorized

“Design Philosophy of the DARPA Internet Protocols”

This paper describes the priorities that motivated the design of the TCP/IP protocol stack, and how those priorities were reflected by the actual protocol specifications. The top priority was resilience to faults: “communication must continue despite loss of networks or gateways.” The second and third priorities were support for different networking services (e.g. TCP vs. UDP) and different types of underlying networks, respectively. In contrast, cost effectiveness and resource accounting were considered by the designers, but were deliberately of less importance. These priorities led to a design in which IP is the basic lingua franca, and provides a datagram-oriented unreliable delivery service. More sophisticated protocols can be layered on top of IP (such as reliable delivery and stream-oriented connections), perhaps the canonical example of the end-to-end argument. Similarly, because IP only require best-effort datagram delivery, it can be implemented on top of a wide variety of network technologies (Ethernet, wireless, satellite, etc.)

Using an unreliable datagram-oriented protocol like IP also simplifies fault tolerance, because it means that intermediate network hops do not need to store connection state. When a hop fails, another route to the destination can be chosen and the connection can be maintained, without needing to recover any state from the failed node. That is, all the state about a connection is maintained by the connection endpoints. The paper coins the term “fate-sharing” to describe this approach: if a connection endpoint fails, it is acceptable to lose the state associated with the connection itself. (Note that this design does not mean that the network core does not need to maintain any state at all — it just doesn’t need to track per-connection state.)

Maintaining connection state only at the endpoints does have some disadvantages. The paper discusses several:

  1. Intermediate network hops only see individual datagrams, not flows/connections. This makes it difficult for those nodes to do resource management and accounting: rather than making decisions about each connection as a whole, they must make resource management decisions about each packet in isolation.
  2. Pushing complexity to the network endpoints increases the difficulty of writing new implementations of the protocol stack, because it encourages “host-resident” algorithms. The end-to-end principle suggests that much of this complexity is difficult to avoid anyway, because both endpoints must be involved to implement important functionality.
  3. Host-resident algorithms might also damage the robustness of the network if an individual host misbehaves (intentionally or not).

Virtual circuits and ATM represent an interesting contrast with the design of IP. The use of a connection-oriented virtual circuit protocol like ATM for the public phone service is instructive:

  1. Because the protocol is connection-oriented, it is easier to implement resource management and accounting.
  2. Individual packets can be smaller: because each virtual circuit uses the same route, individual packets don’t need to be annotated with addressing/routing information.
  3. Network hops can be more efficient, because network routes are calculated on a per-connection basis, rather than routing each packet individually.

I thought the paper’s speculation about replacing datagrams with “flows” for a next-generation network architecture was interesting:

[a flow would] identify a sequence of packets traveling from the source to the destination, without assuming any particular type of service with that service … It would be necessary for the gateways to have flow state in order to remember the nature of the flows which are passing through them, but the state information would not be critical in maintaining the desired type of service associated with the flow.

I was confused by how this would be different than simply building routers which track per-flow state as well as they are able (AFAIK this is done today), but without changing the network protocol itself. It would be good to clarify this proposal in class.

The paper observes that by encouraging host-resident algorithms, a malicious or misconfigured host can harm the robustness of the network as a whole. This is a symptom of a larger problem: the Internet design generally assumes that system administrators can be trusted (witness the constant stream of TCP security vulnerabilities and DOS attacks). By pushing complexity to the communication endpoints and implementing a “dumb network”, the Internet design makes network policy and resource management decisions more difficult.


Filed under Paper Summaries

“End-to-End Arguments in System Design”

Aside: I’m taking CS268 this semester, a graduate class in computer networking at UC Berkeley. We need to post biweekly paper summaries of the papers we read on our blogs, the first of which is below.

The “end-to-end argument” is a classic design principle for building distributed computer systems. Given a piece of functionality (e.g. reliable transmission), how should the system designer decide whether to implement that functionality in the application, or in the lower-level layers of the system? The “end-to-end principle” argues that to correctly implement most application requirements, both communication end-points need to be involved. Since the application must be involved to allow the functionality to be implemented correctly, implementing a redundant version of the functionality in the communications subsystem is not required for correctness. Therefore, implementing functionality in a lower level of the system can only be justified if:

  1. All network clients require the functionality, and it can be correctly implemented at that layer, OR
  2. A low-level implementation offers better performance, and that is judged to be worth the overall performance and complexity costs

For a particular problem, the performance improvement offered by a low-level implementation might be justified. As such, the E2E argument doesn’t condemn any particular design — it just provides a set of tradeoffs that system designers should consider when deciding how to divide a system’s functionality into layers.

The authors observe that implementing functionality in the communications subsystem can be actively harmful, because doing so can impose a performance or complexity cost on other applications that don’t require the functionality (applications that may not even be conceived of when the low-level network is deployed). For example, the ordering/reliable delivery guarantees offered by TCP can delay packet delivery substantially, making it a poor fit for real-time voice and audio applications. This is reminiscent of the C++ design philosophy, “if you don’t use it, you don’t need to pay for it.”

This paper is important because it addresses a fundamental question in the design of layered systems, and because its argument is largely reflected by the Internet architecture. Following the E2E argument leads to systems of many layers; the lowest layers should provide simple, general-purpose communication primitives that can be shared by all network clients. The higher levels provide more convenient features for certain classes of applications. The designer of a new application is free to choose to use high-level protocols if they match his requirements and offer acceptable performance, or to build upon low-level interfaces to implement the necessary functionality himself.

Another perspective on the E2E argument is that by pushing complex functionality into the communications subsystem, the communications subsystem is essentially being optimized for a single application. The E2E principle suggests that this is typically unwise, if the communications subsystem is intended to be general-purpose and might be used with a wide range of applications, including some that aren’t even known during the initial design phase. In contrast, the E2E argument is less relevant if the communications subsystem is being designed only for use by a closed set of applications. When building a system that is specialized for a particular application, pushing high-level functionality into low-level system components is easier to justify, because it can often offer enormous performance improvements.

For example, consider Kickfire, a startup company which builds specialized hardware “appliances” for analytic DB workloads. By pushing predicate evaluation down to specialized hardware components that run close to the I/O devices, a massive improvement in scan performance can be achieved — but this essentially specializes the I/O layer for a particular database application. This isn’t necessarily a violation of the E2E argument, because the system isn’t intended to be general-purpose and because the performance advantage it offers is presumably justified.

Related papers

A more recent comment (1998) by the authors discusses the application of the E2E argument to active networks.

An interesting application of the end-to-end argument can be found in “Understanding the Limitations of Causally and Totally Ordered Communication” by Cheriton and Skeen. They essentially argue that group communication systems violate the end-to-end argument. Ken Birman’s response to the Cheriton and Skeen paper is also worth reading.


Filed under Paper Summaries