Thinking About Cassandra


Jonathan Ellis

This week on the Technometria podcast, Scott and I interview Jonathan Ellis about the Cassandra Project. Cassandra is an open source distributed database management system used by Facebook, Twitter, and other sites. Cassandra is a distributed database that is designed for extreme scalability.

Cassandra is one of the so-called "NOSQL" databases. That's something of a misnomer, because its not specifically SQL that they're lacking--although they are that--but relations. Like Amazon's Dynamo and Google's Bigtable (from which it draws its founding ideas), Cassandra is designed to solve the problems that many modern Web applications have for storing data. That's not to say that relational databases aren't useful, but they come with certain problems like the fact that they're difficult to scale. If all you need is key-value storage, then you're suffering that pain for functionality you're not even using.

At Kynetx, for example, our core engine needs to be able to grab persistant data about a ruleset evalation regardless of which machine it's running on to avoid the need to pin sessions to particular machines--something made more difficult by the fact that interactions may stretch over days. We don't need a relational store--we can get by just fine with key-value storage for this application.

Right now we're using memcachedb (note that that's not memcached). Memcachedb is very fast and uses the memcache API, so it's easy to talk to from just about any language. But the fact that it's based on BerkeleyDB means that it doesn't scale particularly well--essentially the same was as relational databases with a master-slave architecture.

I'm hoping to switch out what we're doing now with Cassandra sometime soon to get the distribution and scalability. There's still some research to do, but I'm looking forward to it.