Trends in Database Systems Research: Column-Stores
Column-store databases are one of the more interesting areas of innovation in recent database systems research. While column-stores have been around for decades, research in the area has recently been kick-started by Mike Stonebraker and others as part of the C-Store project, and it makes for an interesting discussion.
What is a column-store?
The basic premise of this work is simple. Databases which store data by column (with attributes written contiguously on disk) are able to service read queries much faster than more traditional row-store databases (with records written contiguously on disk). Attributes not included in queries can be ignored, rather than just skipped over, and data can be easily compressed, because techniques such as run-length encoding work far more effectively over attributes (where entries are similar), than over rows (where they are distinct). Both features reduce the disk bandwidth required to execute a query, reducing a potentially large bottleneck.
Traditionally the problem with this approach has been a noted slowdown in the speed of updates – the design which makes reading from the database extremely fast, results in the opposite effect when writing. C-Store solves this by creating two stores: a large read-optimized store, and a smaller writeable store. Updates are sent to this smaller store, before being bulk moved to the larger variant at a later date. This works because C-Store is targeted at the data warehousing market, where queries are read-mostly and updates are infrequent. Specialization is key.
If you read one paper from the area, make it C-Store: A Column Oriented DBMS.
Commercial Rivalry
Not surprisingly given the promise of this work, database vendors are taking note. C-Store itself has spawned a commercial version, Vertica.
Perhaps as a result we may see fewer academic papers on the subject, but thankfully a number of the parties involved have created blogs which provide useful insights into the current focus of their work.
The people behind Vertica (and thus C-Store) have an interesting blog named The Database Column, which ostensibly promotes the benefits of column-stores, but backs this up with a lot of interesting work and evaluation.
Daniel Abadi, yet another C-Store member, has recently created his own blog, which oddly seems to have a slightly more commercial slant than the previously mentioned Vertica blog. Again, if you have any interest in this area his posts are worth reading.
More generally, Curt Monash’s DBMS2 blog provides an interesting account of the latest happenings in the commercial database world.