When you’re designing a performance-sensitive computer system, it is important to have an intuition for the relative costs of different operations. How much does a network I/O cost, compared to a disk I/O, a load from DRAM, or an L2 cache hit? How much computation does it make sense to trade for a reduction in I/O? What is the relative cost of random vs. sequential I/O? For a given workload, what is the bottleneck resource?
When designing a system, you rarely have enough time to completely build two alternative designs to compare their performance. This makes two skills useful:
- Back-of-the-envelope analysis. This essentially means developing an intuition for the performance of different alternate designs, so that you can reject possible designs out-of-hand, or choose which alternatives to consider more carefully.
- Microbenchmarking. If you can identify the bottleneck operation for a given resource, then you can construct a micro-benchmark that compares the performance of different implementations of that operation. This works in tandem with your intuition: the more microbenchmarking you do, the better your intuition for system performance becomes.
Jeff Dean makes similar points in his LADIS 2009 keynote (which I unfortunately wasn’t able to attend). In particular, he gives a useful table of “Numbers Everyone Should Know” — that is, the cost of some fundamental operations:
| Operation | Time (nsec) |
| L1 cache reference | 0.5 |
| Branch mispredict | 5 |
| L2 cache reference | 7 |
| Mutex lock/unlock | 25 |
| Main memory reference | 100 |
| Compress 1KB bytes with Zippy | 3,000 |
| Send 2K bytes over 1 Gbps network | 20,000 |
| Read 1MB sequentially from memory | 250,000 |
| Roundtrip within same datacenter | 500,000 |
| Disk seek | 10,000,000 |
| Read 1MB sequentially from disk | 20,000,000 |
| Send packet CA -> Netherlands -> CA | 150,000,000 |
Some useful figures that aren’t in Dean’s data can be found in this article comparing NetBSD 2.0 and FreeBSD 5.3 from 2005. Approximating those figures, we get:
| Operation | Time (nsec) |
| System call overhead | 400 |
| Context switch between processes | 3000 |
| fork() (statically-linked binary) | 70,000 |
| fork() (dynamically-linked binary) | 160,000 |
Update: This recent blog post examines the question of system call and context switch overhead in more detail. His figures suggest the best-case system call overhead is now only ~60 nsec (for Linux on Nehelem), and that context switches cost about 30 microseconds (30,000 nsec) — when you account for the cost of flushing CPU caches, that is probably pretty reasonable.
In comparison, John Ousterhout’s RAMCloud project aims to provide end-to-end roundtrips for a key-value store in the same datacenter within “5-10 microseconds,” which would represent about a 100x improvement over the 500 microsecond latency suggested above.
The keynote slides are worth a glance: Dean talks about the design of “Spanner”, a next-generation version of BigTable he is building at Google. See also James Hamilton’s notes on the keynote, and Greg Linden’s commentary.
Pingback: jardenberg kommenterar – 2009-10-18 — jardenberg unedited
Pingback: Numbers Everyone Should Know | Utah Hadoop User Group
Reblogged this on Anand Jeyahar's blah..blah..radio static…..
Pingback: The Real Interview Questions- Hiring at Sumo Logic | Sumo Logic
Pingback: How to Rock a Systems Design Interview | yizhaoee