Tag Archives: system design

“The Development of the Domain Name System”

The Development of the Domain Name System” describes the basic design of the DNS system, the motivations and design principles employed, and the lessons learned by the designers of the system.

DNS Design

DNS is a distributed, hierarchical database that maps domain names to values. The values associated with a domain name are a collection of resource records (RRs), each with a well-known type (e.g. host IP address). RRs also have a “class,” which specifies the “protocol family” of the RR (e.g. Internet). Given the dominance of IP, I’d expect classes to not be used in the modern DNS system.

DNS names are arranged into a variable-depth tree. The name of each node is given by concatenating the labels of the path from a node to the root, and separating the labels with periods. The tree is divided into “zones,” which are each controlled by a different organization. Each zone is a contiguous region of the tree, although a zone is typically a simple subtree. Administrative authority flows from the root of the tree to the leaves: the root zone controls the set of top-level domains (e.g. COM, NET), which in turn control the contents of their subtree. This allows individual organizations to administer portions of the tree autonomously.

DNS is composed of name servers, which host DNS data, and resolvers, which query the system to resolve domain names. Resolvers can either be individual client machines (e.g. part of libc), or a separate organization-wide resolver service (that might be combined with the organization’s name server). Resolvers can cache DNS lookups, to reduce resolution traffic (particularly on the servers that host the higher levels of the tree). Each DNS record has a “TTL” value that specifies how long it can be cached for. DNS resolvers typically use UDP, rather than TCP, which avoids the need for a TCP handshake before an address can be resolved.


Modern DNS is insecure, arcane, and overly complex. Why? In practice, it seems that the focus on “leanness” described in the paper hasn’t produced a simple and minimalistic system.

I wonder if the focus on extensibility in the DNS design is justified. In practice, DNS is a system for mapping host names to IP addresses, with some secondary functionality like maintaining various (often-inaccurate) administrative information about host names and the MX feature. While extensibility may have been important when the intended use of the system was unclear, it doesn’t seem to be a big win in the modern Internet — a simpler and more constrained design might be a better fit for modern DNS usage. Foregoing extensibility and constraining the core DNS functionality to be a key-value lookup service might have helped to simplify the system.

The lack of any consideration of security is notable, given DNS’s shabby security record.

The paper explains that DNS could have represented multiple resource records of the same type with a single multi-valued record. Their argument for using multiple records is:

The space efficiency of the single RR with multiple values was attractive, but the multiple RR option cut down the maximum RR size. This appeared to promise simpler dynamic update protocols, and also seemed suited to use in a limited-size datagram environment.

I think this is a backwards way to design systems, because it confuses physical layout decisions (e.g. space efficiency of storage) with logical data format decisions (e.g. whether to use multiple “rows” of the same type, or a single row with an array value).

The paper notes that in 1983, root name servers typically processed one query per second (although clients still observed poor response times, 500 milliseconds to 5 seconds in some cases). I couldn’t find much data on the query rates of modern root servers, but the website for the K root server has a graph of query rates which suggests the average load is 15,000-20,000 queries per second. Multiplied by 13 root servers, that is a significant query load (although note that the 13 “root servers” are hosted by far more than 13 physical machines, using techniques like anycast).

Related Reading

Paul Vixie’s article “What DNS Is Not” decries “innovators” who “misuse” the DNS system. He particularly dislikes using IP-based geolocation to use DNS to implement CDNs, for instance.

Leave a comment

Filed under Paper Summaries

Numbers Everyone Should Know

When you’re designing a performance-sensitive computer system, it is important to have an intuition for the relative costs of different operations. How much does a network I/O cost, compared to a disk I/O, a load from DRAM, or an L2 cache hit? How much computation does it make sense to trade for a reduction in I/O? What is the relative cost of random vs. sequential I/O? For a given workload, what is the bottleneck resource?

When designing a system, you rarely have enough time to completely build two alternative designs to compare their performance. This makes two skills useful:

  1. Back-of-the-envelope analysis. This essentially means developing an intuition for the performance of different alternate designs, so that you can reject possible designs out-of-hand, or choose which alternatives to consider more carefully.
  2. Microbenchmarking. If you can identify the bottleneck operation for a given resource, then you can construct a micro-benchmark that compares the performance of different implementations of that operation. This works in tandem with your intuition: the more microbenchmarking you do, the better your intuition for system performance becomes.

Jeff Dean makes similar points in his LADIS 2009 keynote (which I unfortunately wasn’t able to attend). In particular, he gives a useful table of “Numbers Everyone Should Know” — that is, the cost of some fundamental operations:

Operation Time (nsec)
L1 cache reference 0.5
Branch mispredict 5
L2 cache reference 7
Mutex lock/unlock 25
Main memory reference 100
Compress 1KB bytes with Zippy 3,000
Send 2K bytes over 1 Gbps network 20,000
Read 1MB sequentially from memory 250,000
Roundtrip within same datacenter 500,000
Disk seek 10,000,000
Read 1MB sequentially from disk 20,000,000
Send packet CA -> Netherlands -> CA 150,000,000

Some useful figures that aren’t in Dean’s data can be found in this article comparing NetBSD 2.0 and FreeBSD 5.3 from 2005. Approximating those figures, we get:

Operation Time (nsec)
System call overhead 400
Context switch between processes 3000
fork() (statically-linked binary) 70,000
fork() (dynamically-linked binary) 160,000

Update: This recent blog post examines the question of system call and context switch overhead in more detail. His figures suggest the best-case system call overhead is now only ~60 nsec (for Linux on Nehelem), and that context switches cost about 30 microseconds (30,000 nsec) — when you account for the cost of flushing CPU caches, that is probably pretty reasonable.

In comparison, John Ousterhout’s RAMCloud project aims to provide end-to-end roundtrips for a key-value store in the same datacenter within “5-10 microseconds,” which would represent about a 100x improvement over the 500 microsecond latency suggested above.

The keynote slides are worth a glance: Dean talks about the design of “Spanner”, a next-generation version of BigTable he is building at Google. See also James Hamilton’s notes on the keynote, and Greg Linden’s commentary.


Filed under Uncategorized