mysql database-design cassandra rdbms schemaless

How does Cassandra compare to MySQL (or any other RDBMS) in a single node setup?

Having studied about relational databases, document-stores, graph databases, and column-oriented databases, I concluded that something like Cassandra best fits my needs. In particular, the ability to add columns on the fly and no requirement to have a strict schema seals the deal for me. This seems to nicely bridge the gap between a rather novel graph db and a time-tested rdbms.

But I am concerned about how running Cassandra on a single node. Like many others, I can start only with a small amount of data, so more than one node to start with is just not practical. Based on another excellent SO question: Why don't you start off with a "single & small" Cassandra server as you usually do it with MySQL? I concluded that Cassandra can indeed be run just fine as a single node, as long as one is willing to give up benefits like availability which are derived from a multi-node setup.

There also seem to be ways of implementing dynamic adding of fields in an RDBMS for instance as discussed here on SO: How to design a database for User Defined Fields? This would, to some extent, mimic schemaless-ness.

So I would now like to understand how do Cassandra and MySQL compare - with regard to features and performance, on a single node setup? What would you advise someone in my situation - start with a simple RDBMS with the plan/intent to switch to Cassandra later on? Or start with Cassandra?

Solution

In a single node setup of Cassandra, many of the advantages of Cassandra are lost, so the main reason for doing that would be if you intended to expand to multiple nodes in the future. Performance would tend to favor RDBMS in most applications when using a single node since RDBMS is designed for that environment and can assume all data is local.

The strengths of Cassandra are scalability and availability. You can add nodes to increase capacity and having multiple nodes means you can deal with hardware failures and not have downtime. These strengths come at the cost of more difficult schema design since access is based primarily on consistent hashing. It also means you don't have full SQL available and often must rely on denormalization techniques to support fast access to data. Cassandra is also weak for ACID transactions since it is inherently difficult to coordinate atomic actions on multiple nodes.

RDBMS by contrast is a more mature technology. ACID transactions are no problem. Schema design is much simpler since you can add efficient indexes to any column to optimize queries, and you have joins available so that redundant data can be largely eliminated. By eliminating redundant data it is much easier to keep your data consistent, since there are not multiple copies of data that need to be updated when someone changes their address for example. But you run the risk of running out of space on a single machine to store all your data. And if you get a disk crash you will have downtime and need backups to restore the data, while Cassandra can often easily repair the data on a node that is out of sync. There is also no easy way to scale an RDBMS to handle higher transaction rates other than buying a faster machine.

There are a lot of other differences, but those are the major ones. Neither one is better than the other, but each one may be better suited to certain applications. So it really depends on the requirements of your use case which one will be a better fit.