Search code examples
cassandra

Guaranteeing uniqueness in sorted data in Cassandra


I'm brand new to Cassandra and I'm having trouble trying to figure out how to order my data. I'm trying to record messages from twitch.tv for each streamer. I want something like the following

CREATE TABLE IF NOT EXISTS chat_data.twitch_chat_by_broadcaster_and_timestamp (
    broadcaster_id int,
    timestamp int,
    message text,
    PRIMARY KEY (broadcaster_id, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);

While not an issue for a hobby project, how do you solve the problem of two messages being sent at the exact same time. I'm aware of TimeUUID, but I want the messages sorted by the timestamp of when they were sent, and it doesn't look like there are any official ways to create a TimeUUID other than now().

I envision using this database for getting all messages in a given time range for a given stream. I realize I could add another UUID field to guarantee uniqueness, but then that would screw up my query since I have to specify an exact time.

What's the right way to order this data to allow efficient querying?


Solution

  • I realize I could add another UUID field to guarantee uniqueness, but then that would screw up my query since I have to specify an exact time.

    Not if you added it as the last clustering column:

    CREATE TABLE IF NOT EXISTS chat_data.twitch_chat_by_broadcaster_and_timestamp (
        broadcaster_id int,
        timestamp int,
        message_id uuid,
        message text,
        PRIMARY KEY (broadcaster_id, timestamp, message_id)
    ) WITH CLUSTERING ORDER BY (timestamp DESC, message_id ASC);
    

    This way, you could simply generate a new UUID at write-time. As long as you queried by the partition key of broadcaster_id, the data would still be sorted by (both) timestamp, with message_id added to ensure uniqueness.

    As far as I can see it, the query for messages by broadcaster_id doesn't change:

    > SELECT * FROM twitch_chat_by_broadcaster_and_timestamp WHERE broadcaster_id=68;
    
     broadcaster_id | timestamp     | message_id                           | message
    ----------------+---------------+--------------------------------------+---------------------------
                 68 | 1705328231991 | ac74c7ba-1813-4b75-b873-8817c6f89a55 |            End of stream.
                 68 | 1705328230499 | 0e2116d5-887c-4433-b9cc-e5c13534902d |  Next stream coming soon.
                 68 | 1705327984482 | 01c52a3a-c1c3-4db4-b8a8-6f9fe3111911 |        Incoming messages!
                 68 | 1705327983581 | 7a3c2444-5746-4928-9e6a-0ebaac7fbec5 | Next stream starting soon