Search code examples
cassandradata-modelingcassandra-2.0cql3nosql

Cassandra data model with multiple conditions


I'm new to Cassandra, so I read a dozen articles about it and thus I know the basics. All the tutorials show efficient data retrieval by 1 or 2 columns and a time range. What I could not find was how to correctly model your data if you have more conditions.

I have a big events normalised database, with quite a few columns, say:

  • Event type
  • time
  • email
  • User_age
  • user_country
  • user_language
  • and so on.

I would need to be able to query by all columns. So in RDBMS I would query:

SELECT email FROM table WHERE time > X AND user_age BETWEEN X AND X AND user_language = 'nl' etc..

I know I can make a separate table for each column, but then I would still need to combine the results. Maybe this is not a bad approach, but I doubt it since there are no subqueries.

My question is obviously, how can I model this kind of data correctly in Cassandra?

Thanks a lot!


Solution

  • I would need to be able to query by all columns.

    Let me stop you right there. In Cassandra, you create your tables based on your anticipated query patterns, and usually a table supports a single query. In your case, you have "quite a few" columns and you will need to duplicate that data into a table designed to support each possible query. That is going to get big and ungainly, very quickly.

    Could we just add the rest as secondary indexes? there could potentially still be millions of rows in the eventtype table + merchant_id + time selection.

    Secondary indexes are intended to be used on middle-of-the-road cardinality columns. So both, extremely low and extremely high cardinality columns are bad for secondary indexes. The problem, is that Cassandra will have to pick one of your nodes as a coordinator, scan the index on each node (incurring lots of network time), and then build and return the result set. It's a prescription for poor performance, that flies in-the-face of the best practices for working with a distributed database.

    In short, Cassandra is not a good solution for use cases like this. It sounds like you want to be able to do OLAP-type queries, and for that you should use a tool that is better-suited for that purpose.