Category Archives: Uncategorized

Scripts for writing papers

One of the nice things about LaTeX is that the definitive version of the document is stored as plain text. This simplifies version control and avoids the need to use a complicated editor with an opaque binary file format. It also means that writing scripts to check for writing mistakes is relatively easy. Of course, detecting grammatical errors in general is very challenging, but certain classes of mistakes are easy to check for. Matt Might has posted a few scripts which I’ve found quite useful.

I had cause to write another script to check for consistent use of hyphens. For example, both “non-deterministic” and “nondeterministic” are acceptable, but a single document should pick one variant and use it consistently. You can find the script here; suggestions for improvement (rather, patches) are welcome.

What other common mistakes are amenable to a simple script? Checking for consistent use of the Oxford comma should be straightforward, for example. There are also scripts like chktex that look for mistakes in LaTeX usage (although I’ve personally found chktex to be too noisy to be useful).

Naturally, there are limits on what you can do without parsing natural language. After The Deadline claims to provide some NLP-based grammar analysis, but I haven’t used it personally.

Resources

  • atdtool, a command-line interface to After The Deadline
  • diction, a GNU tool to check for common writing mistakes
  • LanguageTool, another open source style and grammar checker
  • TextLint, a writing style checker

2 Comments

Filed under Uncategorized

Berkeley DB Lunch: Sailesh Krishnamurthy, Truviso

The Berkeley DB group holds a weekly lunch lecture series in the fall semester. This week’s speaker is Sailesh Krishnamurthy, one of the co-founders of Truviso (and a 2006 alum of the Berkeley DB group). I had the good fortune to work with Sailesh at Truviso, so I’m sure this will be a terrific talk. The abstract is below—this work on intelligently handling out-of-order data, and generalizing that to “corrections” of all kinds—was only beginning when I was finishing up at Truviso, but it seems really interesting.

Time: 1-2PM, November 6, 2009 (lunch starts at 12:30, the talk starts at 1)

Location: 606 Soda Hall, UC Berkeley

Title: ACDC – Analytics over Continuous and DisContinuous Streams

ABSTRACT

Streaming continuous analytics systems have emerged as key solutions for dealing with massive data volumes and demands for low latency. These systems have been heavily influenced by an assumption that data streams can be viewed as sequences of ordered data. The reality, however, is that streams are not continuous and disruptions of various sorts in the form of either big chunks of late arriving data or arbitrary failures are endemic. We argue, therefore, that stream processing needs a fundamental rethink and advocate a unified approach providing Analytics over Continuous and DisContinuous (ACDC) streams of data. Our approach is based on a simple insight – partially process independent runs of data and defer the consolidation of the associated partial results to when the results are actually used on an on demand basis. Not only does our approach provide the first real solution to the problem of data that arrives arbitrarily late, it also lets us solve a host of hard problems such as parallelism, recovery, transactional consistency and high availability that have been neglected by streaming systems. In this talk we describe the Truviso ACDC approach and outline some of the key technical arguments and insights behind it.

Speaker: Sailesh Krishnamurthy, Vice President of Technology and Founder, Truviso, Inc.

Dr. Sailesh Krishnamurthy, PhD is responsible for setting and driving the overall technical strategy and direction for the Truviso product and solution portfolio. In addition, he works in close collaboration with marketing, sales and engineering teams in managing the product and solution roadmap, performance engineering, and technology evangelism. Previously, he built and managed the initial engineering, services and support teams at Truviso. Sailesh is a leading authority in the field of enterprise data management with over a dozen published academic papers and several U.S. patents. Sailesh investigated the technical ideas at the heart of Truviso’s products as part of his doctoral research on stream query processing, earning a PhD. in Computer Science from UC Berkeley in 2006. Prior to graduate work at Berkeley, he worked at the Database Technology Institute at IBM Corporation where he designed and developed advanced features in IBM database products. Earlier, he worked on a Java virtual machine implementation at Netscape Communications. Sailesh has a Master’s degree in computer Science from Purdue University and a Bachelor’s degree in Electrical Engineering from the Birla Institute of Technology and Science in Pilani, India.

Thanks to Daisy Zhe Wang for organizing the DB seminar this semester.

Leave a comment

Filed under Uncategorized

Numbers Everyone Should Know

When you’re designing a performance-sensitive computer system, it is important to have an intuition for the relative costs of different operations. How much does a network I/O cost, compared to a disk I/O, a load from DRAM, or an L2 cache hit? How much computation does it make sense to trade for a reduction in I/O? What is the relative cost of random vs. sequential I/O? For a given workload, what is the bottleneck resource?

When designing a system, you rarely have enough time to completely build two alternative designs to compare their performance. This makes two skills useful:

  1. Back-of-the-envelope analysis. This essentially means developing an intuition for the performance of different alternate designs, so that you can reject possible designs out-of-hand, or choose which alternatives to consider more carefully.
  2. Microbenchmarking. If you can identify the bottleneck operation for a given resource, then you can construct a micro-benchmark that compares the performance of different implementations of that operation. This works in tandem with your intuition: the more microbenchmarking you do, the better your intuition for system performance becomes.

Jeff Dean makes similar points in his LADIS 2009 keynote (which I unfortunately wasn’t able to attend). In particular, he gives a useful table of “Numbers Everyone Should Know” — that is, the cost of some fundamental operations:

Operation Time (nsec)
L1 cache reference 0.5
Branch mispredict 5
L2 cache reference 7
Mutex lock/unlock 25
Main memory reference 100
Compress 1KB bytes with Zippy 3,000
Send 2K bytes over 1 Gbps network 20,000
Read 1MB sequentially from memory 250,000
Roundtrip within same datacenter 500,000
Disk seek 10,000,000
Read 1MB sequentially from disk 20,000,000
Send packet CA -> Netherlands -> CA 150,000,000

Some useful figures that aren’t in Dean’s data can be found in this article comparing NetBSD 2.0 and FreeBSD 5.3 from 2005. Approximating those figures, we get:

Operation Time (nsec)
System call overhead 400
Context switch between processes 3000
fork() (statically-linked binary) 70,000
fork() (dynamically-linked binary) 160,000

Update: This recent blog post examines the question of system call and context switch overhead in more detail. His figures suggest the best-case system call overhead is now only ~60 nsec (for Linux on Nehelem), and that context switches cost about 30 microseconds (30,000 nsec) — when you account for the cost of flushing CPU caches, that is probably pretty reasonable.

In comparison, John Ousterhout’s RAMCloud project aims to provide end-to-end roundtrips for a key-value store in the same datacenter within “5-10 microseconds,” which would represent about a 100x improvement over the 500 microsecond latency suggested above.

The keynote slides are worth a glance: Dean talks about the design of “Spanner”, a next-generation version of BigTable he is building at Google. See also James Hamilton’s notes on the keynote, and Greg Linden’s commentary.

19 Comments

Filed under Uncategorized

New Blog

Welcome to my new blog! I plan to talk about my research, interesting papers that I come across, and the data management and systems research communities in general.

I was previously blogging on Advogato (mirrored at PlanetPostgreSQL), mostly talking about my work in the Postgres community. Unfortunately, the demands of grad school have meant that I’m not able to do any meaningful work on PostgreSQL right now (or for the foreseeable future). Hence, a new blog, and hopefully a new audience.

Leave a comment

Filed under Uncategorized