Search code examples
cassandracassandra-2.0cassandra-cli

What do you do change in the such a data model Cassandra?


I have task to create a social feed(news feed). I think no need to explain the standard functionality - all are how as FB. I chose solution apache cassandra and designed a data column Posts for storing information about posts users:

CREATE TABLE Posts (
  post_id  uuid,
  post_at  timestamp,
  user_id  text,
  name     varchar,
  category set<text>,
  link     varchar,
  image    set<varchar>,
  video    set<varchar>,
  content  map<text, text>,
  private  boolean,

  PRIMARY KEY ((post_id, user_id), post_at)
)
WITH CLUSTERING ORDER BY (post_at DESC) COMPACT STORAGE;

The next table contains id user posts:

CREATE TABLE posts_user (
  post_id  bigint,
  post_at  timestamp,
  user_id  bigint,
  PRIMARY KEY ((post_id), post_at, user_id)
)

WITH CLUSTERING ORDER BY (post_at DESC) AND COMPACT STORAGE;

How do you think, is it good? What do you do change in the such a data model?


Solution

  • There are a couple of questions and a couple of improvements that jump out.

    1. COMPACT STORAGE is deprecated now (if you want to take advantage of CQL 3 features). I do not think that you can create your table Posts as you have defined above since it uses CQL 3 features (collections) with COMPACT STORAGE as well as declaring more than one column that is not part of the primary key.

    2. posts_user has completely different key types than Posts does. I am not clear on what the relationship between the two tables is, but I imagine that post_id is supposed to be consistent between them, whereas you have it as a uuid in one table and a bigint in the other. There are also discrepancies with the other fields.

    3. Assuming post_id is unique and represents the id of an individual post, it is strange to have it as the first part of a compound primary key in the Posts table since if you know the post_id then you can already uniquely access the record. Furthermore, as it is part of the partition key it also prevents you from doing wider selects of multiple posts and taking advantage of your post_at ordering.

    The common method to fix this is to create a dedicated index table to sort the data the way you want.

    E.g.

    CREATE TABLE posts (
      id       uuid,
      created  timestamp,
      user_id  uuid,
      name     text,
      ...
      PRIMARY KEY (id)
    );
    
    CREATE TABLE posts_by_user_index (
      user_id    uuid,
      post_id    uuid,
      post_at    timestamp,
      PRIMARY KEY (user_id,post_at,post_id)
      WITH CLUSTERING ORDER BY (post_at DESC)
    );
    

    Or more comprehensively:

    CREATE TABLE posts_by_user_sort_index (
      user_id    uuid,
      post_id    uuid,
      sort_field text,
      sort_value text,
      PRIMARY KEY ((user_id,sort_field),sort_value,post_id)
    );
    

    However, in your case if you only wish to select the data one way, then you can get away with using your posts table to do the sorting:

    CREATE TABLE posts (
      id       uuid,
      post_at  timestamp,
      user_id  uuid,
      name     text,
      ...
      PRIMARY KEY (user_id,post_at,id)
      WITH CLUSTERING ORDER BY (post_at DESC)
    );
    

    It will just make it more complicated if you wish to add additional indexes later since you will need to index each post not just by its post id, but by its user and post_at fields as well.