Search code examples
databasecassandranosqldata-modelingcql

Cassandra - Grouping By ID and Ordering by Date


A portion of my application consists of a discussion board: there are threads, posts, and categories. Threads are grouped by category, and posts are grouped by threads. I'm having a problem coming up with a model / query that will allow the selection threads by category with a descending ordering of their last post.

Category

CREATE TABLE keyspace.categories (
    id ascii PRIMARY KEY,
    description text,
    name text,
    ...
);

Thread

CREATE TABLE keyspace.threads (
    id ascii PRIMARY KEY,
    category_id ascii,
    content text,
    ...
);

Post

CREATE TABLE keyspace.posts (
    thread_id ascii,
    created_at timestamp,
    id ascii,
    content text,
    ...
    PRIMARY KEY (thread_id, created_at, id)
);

I initially thought about putting the last post's "created at" time as a clustering key on the thread table, but that's impossible as it changes with each post.

I then thought about creating an intermediate table that is written to every time a post is created. This solves the immutability issue with the first approach, but the problem is that it will contain multiple values per thread and I have not been able to figure out a partition / clustering order that would support grouping by thread and ordering by date.

For example, the following would allow me to group by thread, but not order by date:

CREATE TABLE last_post_for_category (
    category_id ascii,
    thread_id ascii,
    created_at timestamp,
    PRIMARY KEY ((category_id), thread_id, created_at)
) WITH CLUSTERING ORDER BY (thread_id DESC, created_at DESC);

SELECT thread_id FROM last_post_for_category WHERE category_id = 'category' GROUP BY thread_id, created_at;

And the following would allow me to order by date, but not group by thread:

CREATE TABLE keyspace.last_post_for_category (
    category_id ascii,
    thread_id ascii,
    created_at timestamp,
    PRIMARY KEY ((category_id), created_at, thread_id)
) WITH CLUSTERING ORDER BY (created_at DESC, thread_id DESC);

SELECT thread_id FROM last_post_for_category WHERE category_id = 'category' GROUP BY created_at, thread_id;

I'm not able to do a distinct on (category_id, thread_id) either as I know nothing about thread IDs at the point in which this query is executed.

Does anyone have any idea on how I can best represent this ordering?


Solution

  • First of all, I recommend that you use the datatype datetime rather than timestamp, since it will make it easy for you to modify it or set a default value. This is just a recommendation.

    Suggested solution:

    Add the attribute last_post to the table threads to save the time of last added post in each thread.
    When a thread is first created, the last_post value should equal a very old date (because no posts in that thread yet).

    After tha, create a trigger so that whenever a post is inserted in posts, the trigger updates the last_post value of the corresponding thread. The trigger can be added like this:

    CREATE TRIGGER triggerName ON posts
    FOR INSERT
    AS
    declare @post_time datetime;
    declare @thread_id int;
    select @post_time=i.created_at from inserted i;
    select @thread_id=i.thread_id from inserted i;
    
    update threads set lastpost = @post_time where id=@thread_id  
    GO
    

    Last step will be a direct query to select threads by category sorted by last_post, just like this:

    select * from threads where category_id = 'theCategoryYouWant' order by lastpost asc /*or desc as you like*/  
    

    Note: if you want created_at to be updated when the post is edited, you will need to add a similar trigger to update the last_post attribute of the corresponding thread.