Search code examples
cassandranosqlcql

Design schema to query select beetwen two separate dates


I have a table for messages, like this:

messages (
    user_id uuid,
    id uuid,
    message blob,
    created_at timestamp,
    updated_at timestamp,
    PRIMARY KEY (user_id, id)
)

I can create a MV for sort and select messages by updated_at. But when I need to select last updated messages from last time when client sync (e.g. select where updated_at > 1555602962006 and created_at < 1555602962006) - only way is to select all messages by updated_at and filter rows in code? Is this a normal practice in production?

Maybe in this case, it's possible to create some token to sort by concatenating created_at and updated_at or something?


Solution

  • So the tricky part about this, is that Cassandra will not allow a range query on multiple, different columns. Secondly, range queries only work within a partition, so you'll need to come up with a time "bucketing" strategy.

    One way that I have modeled around this in the past, was to have one column for a timestamp, but different rows for the type of time. I'll put together an example here, using your table above. For a time bucket, I'll use month; essentially, I know I'll only ever query for messages created/updated in the last month (that may or may not work for you).

    CREATE TABLE messages_by_month (
      month_bucket int,
      event_time timestamp,
      user_id uuid,
      id uuid,
      message text,
      event text,
      PRIMARY KEY (month_bucket, event_time, id));
    

    After INSERTing some data, I can now query for a range of created and updated event times, like this:

    SELECT id,event_time,event,message
    FROM messages_by_month 
    WHERE month_bucket=201905
      AND event_time > 1558619000000
      AND event_time < 1558624900000;
    
     id                                   | event_time                      | event   | message
    --------------------------------------+---------------------------------+---------+---------
     a4d60c29-ad4e-4023-b869-edf1ea6207e2 | 2019-05-23 14:00:00.000000+0000 | CREATED |     hi!
     66e78a1e-dbcb-4f64-a0aa-6d5b0e64d0ed | 2019-05-23 14:20:00.000000+0000 | CREATED |     hi!
     f1c59bf4-1351-4527-a24b-80bb6e3a2a5c | 2019-05-23 15:00:00.000000+0000 | UPDATED |    hi2!
     a4d60c29-ad4e-4023-b869-edf1ea6207e2 | 2019-05-23 15:20:00.000000+0000 | UPDATED |    hi3!
    
    (4 rows)
    

    While this exact example may not work for you, the idea is to think about organizing your data differently to support your query.

    Note: With this example, you'll need to watch your partition sizes. You may end up having to add another partition key if this table gets more than 50k-100k messages in a month.