Search code examples
apache-sparkcassandra

Should I denormalize a Cassandra table with 1B rows according to the queries being used?


I know that in a Cassandra table, inserts with the same partition key will overwrite the previous value. So, if we also insert 10 records with the same primary key it will do the same, meaning overwrite and store the 10th value only. Right?

So, I have the below table in my Cassandra database which has ~1 billion rows with ~4800 partition keys:

CREATE TABLE tb(
parkey varchar, //this is a UUID converted to String.
pk1 text,
pk2 float,
pk3 float,
pk4 float,
pk5 text,
pk6 text,
pk7 text,
PRIMARY KEY ((parkey),pk1, pk2, pk3, pk4, pk5, pk6, pk7));

This means I have ~1 billion primary keys!! I have such a big primary key because every record is unique only if it has all the values. However, I have a feeling this might not be the best table schema, as it also takes 5 minutes for spark to query all these data while it also hangs for another 10 minutes just before unpersisting a table from memory for which I do not know why!

Should I break down and denormalize the table somehow according to the queries being used? Will that improve the query times? My thought is, that even if I break down the table, I will still have ~1 billion primary keys for each denormalized table that will be created. Would that be efficient? Will it not take again 15 minutes to query the newly created tables?

Edit 1

I am always using 1 query that selects partition keys. Hence one table. Would this improve times?

CREATE TABLE tb(
parkey varchar, //this is a UUID converted to String.
pk1 varchar, //also a UUID but completely unique for every record
c1 text,
c2 float,
c3 float,
c4 float,
c5 text,
c6 text,
c7 text,
PRIMARY KEY ((parkey),pk1));

Solution

  • The quick answer is YES, you should denormalise the data and always start with the app queries. Those who come from a relational DB background tend to focus on how the data is stored (table schema) instead of listing all the app queries first.

    By focusing on the app queries first THEN designing a table for each of the queries, the table is optimised for reads. If you try to adapt an app query to an existing table then the table will never be optimised and the queries will almost always be slow.

    As a side note, the long answer is that 1B rows != 1B partitions in the schema you posted. The table definition does not have a 1:1 mapping between rows and partitions. Each of the partitions in your table can have ONE OR MORE rows. Cheers!