Is Cassandra suitable for Aggregate Queries?

I have read that Columnar databases are apt for Aggregate Queries and Cassandra is a columnar database. I am trying to use count( values 'between' or '>=' for a specific partition) in Cassandra. Is this performance intensive?

Solution

It's a common misconception that Cassandra is a columnar database. I think it comes from the old terminology "column family" for tables. Data is stored in rows containing columns of key-value pairs which is why the tables used to be called column families.

A major difference compared to traditional relational databases is that Cassandra tables can be 2-dimensional (each record contains exactly one row) or multi-dimensional (each record can contain ONE OR MORE rows).

On the other hand, columnar databases flips a 2-dimensional table such that data is stored in columns instead of rows, specifically optimised for analytics-type queries such as aggregations -- this is NOT Cassandra.

Going back to your question, counting the rows within a single partition is ok to do for most data models. The key is to restrict the query to just one partition like:

    SELECT COUNT(some_column) FROM table_name
        WHERE pk = ?

It's also OK to count the rows in a range query as long as they're restricted to one partition like:

    SELECT COUNT(some_column) FROM table_name
        WHERE pk = ?
        AND clustering_col >= ?
        AND clustering_col <= ?

If you don't restrict the query to a single partition, it might work for (a) very small datasets and (b) clusters with a very low number of nodes but it doesn't scale as (c) the dataset grows, and (d) the number of nodes increases. I've explained why performing aggregates such as COUNT() is bad in Cassandra in this post -- https://community.datastax.com/questions/6897/.

This is not to say that Cassandra isn't a good fit. Cassandra is a good choice if your primary use case is for storing real-time data for OLTP workloads. For analytics queries, you just need to use other software like Apache Spark since the spark-cassandra-connector will optimise the queries to Cassandra. Cheers!