database cassandra datastax datastax-enterprise datastax-startup

Social media's like and unlike data model in Cassandra

Imagine there is a social network and here is a table for storing the like (favorite) action and unlike that is deleting from this table:

CREATE TABLE IF NOT EXISTS post_likes(
  post_id timeuuid,
  liker_id uuid, //liker user_id
    like_time timestamp,
    PRIMARY KEY ((post_id) ,liker_id, like_time)
) WITH CLUSTERING ORDER BY (like_time DESC);

The above table has problem in Cassandra because when liker_id is the first clustering_key, we can't sort by the second clustering key which is like_time.

We need to sort our tables data by like_time, we use it when a user wants to see who liked this post and we show list of people who liked that post that sorted by time (like_time DESC)

and we also need to delete (unlike) and we again need to have post_id and liker_id

What is your suggestion? How we can sort this table by like_time?

Solution

After more researches, I found out this solution: Picking the right data model is the hardest part of using Cassandra and here is the solution we found for likes tables in Cassandra, first of all, I have to say Cassandra's read and write path is amazingly fast and you don't need to be worry about writing on your Cassandra's tables, you need to model around your queries and remember, data duplication is okay. Many of your tables may repeat the same data. and do not forget to spread data evenly around the cluster and minimize the number of partitions read

Since we are using Cassandra which is NoSQL, we know one of the rules in NoSQLs is denormalization and we have to denormalize data and just think about the queries you want to have; Here for the like table data modeling we will have two tables, these tables have mainly focused on the easy read or easier to say we have focused on queries we want to have:

CREATE TABLE IF NOT EXISTS post_likes(
    post_id timeuuid,
    liker_id uuid, //liker user_id
    like_time timestamp,
    PRIMARY KEY ((post_id) ,liker_id)
);

CREATE TABLE IF NOT EXISTS post_likes_by_time(
    post_id timeuuid,
    liker_id uuid, //liker user_id
    like_time timestamp,
    PRIMARY KEY ((post_id), like_time, liker_id)
) WITH CLUSTERING ORDER BY (like_time DESC);

When a user like a post, we just insert into both above tables.

why do we have post_likes_by_time table?

In a social network, you should show list of users who liked a post, it is common that you have to sort likes by the like_time DESC and since you are going to sort likes by like_time you need to have like_time as clustering key to be able to sort likes by time.

Then why do we have post_likes table too?

In the post_likes_by_time, our clustering key is like_time, we also need to remove one like! We can't do that when we sorted data in our table when clustering key is like_time. That is the reason we also have post_likes table

Why you could not only have one table and do both actions, sorting and removing on it?

To delete one like from post_likes table we need to provide user_id (here liker_id) and post_id (together) and in post_likes_by_time we have like_time as clustering key and we need to sort table by like_time, then it should be the first clustering key and the second clustering key could be liker_id, and here is the point! like_time is the first clustering key then for selecting or deleting by liker_id you also need to provide like_time, but you do not have like_time most of the times.