I was talking to a database researcher recently about why the artificial intelligence community and the database community haven’t historically seen eye-to-eye. The researcher’s opinion was that AI folks tend to regard databases as hopelessly limited in their expressive power, whereas DB folks tend to view AI data models as hopelessly difficult to implement efficiently. There is probably some truth to both views.
I was reminded of this when doing some reading about data management techniques for RDF (the proposed data model for the Semantic Web). Abadi et al.’s “Scalable Semantic Web Data Management Using Vertical Partitioning” is a nice paper from VLDB 2007, and appears to be one of a relatively small group of papers that approach the Semantic Web from a database systems perspective. The paper proposes a new model for storing RDF data, which essentially applies the column-store ideas from the C-Store and Vertica projects. Sam Madden and Daniel Abadi talk about their ideas more in a blog entry at The Database Column.
Planet PostgreSQL readers might be interested in this observation in the paper:
We chose Postgres as the row-store to experiment with because Beckmann et al. experimentally showed that it was by far more efficient dealing with sparse data than commercial database products. Postgres does not waste space storing NULL data: every tuple is preceded by a bit-string of cardinality equal to the number of attributes, with ‘1’s at positions of the non-NULL values in the tuple. NULL data is thus not stored; this is unlike commercial products that waste space on NULL data. Beckmann et al. show that Postgres queries over sparse data operate about eight times faster than commercial systems
(A minor nitpick: Postgres will omit the per-tuple NULL bitmap when none of the attributes of a tuple are NULL, so it is not quite true that “every tuple is preceded by a bit-string”.)
The cited Beckman et al. paper is “Extending RDBMSs To Support Sparse Datasets Using An Interpreted Attribute Storage Format“.
It’s interesting that none of the leading commercial systems seem to use exactly the same NULL bitmap approach that Postgres does. The tradeoff appears to be of storage against computation time: eliding the NULL values from the on-disk tuple reduces storage requirements, but makes it more expensive to find the offset within a tuple at which an attribute begins, if the attribute is preceded by one or more (elided) NULL values. If NULL values were stored in the on-disk tuple (and no variable-width attributes are used), the offset of an attribute can be found more efficiently.
In practice, Postgres implements another optimization that mitigates this problem to some extent: as tuples are passed around the executor and attributes are “extracted” from the on-disk tuple representation, they are effectively cached using the TupleTableSlot mechanism. This means that the computation to find the right offset for an attribute in the presence of NULLs is typically only done at most once per attribute of a tuple.