November 25, 2006

Database Architecture

Revolution In Database Architecture
is a nice short
paper that outlines Jim Gray’s thoughts on future
directions for database systems. A few samples:

Random access is a hundred times slower than sequential.
These changing ratios require new algorithms that
intelligently use multi-processors sharing a massive main
memory, and intelligently use precious disk bandwidth. The
database engines need to overhaul their algorithms to deal
with the fact that main memories are huge (billions of
pages, trillions of bytes). The era of main-memory
databases has finally arrived.

Cost-based static-plan optimizers continue to be the
mainstay for simple queries that run in seconds. But, for
complex queries, the query optimizer must adapt to current
workloads, must adapt to data skew and statistics, and must
plan in a much more dynamic way – changing plans as the
system load and data statistics change. For petabyte-scale
databases it seems the only solution is to run continuous
data scans and let queries piggyback on the scans. Teradata
pioneered that mechanism, and it is likely to become more
common in the future.

The database community has found a very elegant way to
embrace and extend machine learning technology like
clustering, decision trees, Bayes nets, neural nets, time
series analysis, etc… The key idea is to create a
learning table T; telling the system to learn
columns x, y, z from attributes a, b, c
(or to cluster attributes a, b, c, or to treat
a as the time stamp for b.) Then one
inserts training data into the table T, and the
data mining algorithm builds a decision tree or Bayes net or
time series model for the data …. After the training
phase, the table T can be used to generate
synthetic data; given a key a,b,c it can return the
likely x,y,z values of that key along with the
probabilities. Equivalently, T can evaluate the
probability that some value is correct. The neat thing
about this is that the framework allows you to add your own
machine-learning algorithms to this framework. This gives
the machine-learning community a vehicle to make their
technology accessible to a broad user base.

I’m not convinced that such a model is really the right way
to integrate data mining into the DBMS, but it’s an
interesting (and easily implemented) idea. You could build
such a system using Postgres’ extensibility features pretty
easily, although backend modifications would probably be
necessary for proper integration. PL/R would definitely be a
useful tool.

Real Life

Sorry for not posting the solution to the math problem I
blogged about a few months (!) ago — I’ll get around to
that shortly.


Leave a comment

Filed under Advogato, PostgreSQL

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s