Search code examples
cassandranosqlcassandra-3.0database-partitioningscylla

Cassandra changing Primary Key vs Firing multiple select queries


I have a table that stores list products that a user has. The table looks like this.

create table my_keyspace.userproducts{
  userid,
  username,
  productid,
  productname,
  producttype,
Primary Key(userid)
}

All users belong to a group, there could be min 1 to max 100 users in a group

userid|groupid|groupname|
1     |g1     | grp1  
2     |g2     | grp2  
3     |g3     | grp3  

We have new requirement to display all products for all users in a single group.

So do i change my userproducts so that my Partition Key is now groupid and make userid as my cluster key, so that i get all my results in one single query.

Or do I keep my table design as it is and fire multiple select queries by selecting all users in a group from second table and then fire one select query for each user, consolidate data in my code and then return it to the users

Thanks.


Solution

  • Even before getting to your question, your data modelling as you presented it has a problem: You say that you want to store "a list products that a user has". But this is not what the table you presented has - your table has a single product for each userid. The "userid" is the key of your table, and each entry in the table, i.e, each unique userid, has one combination of the other fields.

    If you really want each user to have a list of products, you need the primary key to be (userid, productid). This means that each record is indexed by both a userid and a productid, or in other words - a userid has a list of records each with its own productid. Cassandra allows you to efficiently fetch all the productid records for a single userid because it implements the first part of the key as a "partition key" but the second part is a "clustering key".

    Regarding your actual question, you indeed have two options: Either do multiple queries on your original tables, or do so-called denormalization, i.e., create a second table with exactly what you want searchable immediately. For the second option you can either do it manually (update both tables every time you have new data), or let Cassandra update the second table for you automatically, using a feature called Materialized Views.

    Which of the two options - multiple queries or multiple updates - to use really depends on your workload. If it has many updates and rare queries, it is better to leave updates quick and make queries slower. If, on the other hand, it has few updates but many queries, it is better to make updates slower (when each update needs to update both tables) but make queries faster. Another important issue is how much query latency is important for you - the multiple queries option not only increases the load on the cluster (which you can solve by throwing more hardware at the problem) but also increases the latency - a problem which does not go away with more hardware and for some use cases may become a problem.

    You can also achieve a similar goal in Cassandra by using the Secondary Index feature, which has its own performance characteristics (in some respects it is similar to the "multiple queries" solution).