Search code examples
cassandracql

How to properly create a table for user messages?


I have a table with user channels in the Apache Cassandra database. And queries responsible for getting the list of channels and getting data about the channel. One of the queries requires ALLOW FILTERING, we think that the best option is to make a duplicate of the table, only it will have a primary key ((channel_id, bucket), user_id), and then the question arose, is there something like links in Apache Cassandra, so that the database does not have to constantly write and update data in both tables. If you know a solution where this does not need to be done, then I will be very grateful for the solution

create table channels
(
    user_id     bigint,
    bucket      int,
    channel_id  bigint,
    flags       int,
    permissions bigint,
    type        int,
    primary key ((user_id, bucket), channel_id)
)
SELECT * FROM channels WHERE user_id = :user_id AND bucket = :bucket
SELECT * FROM channels WHERE channel_id = :channel_id ALLOW FILTERING

Solution

  • The suggestion is to create two tables and manage them on the application level. It's called Denormalization; let me explain.

    Regarding ALLOW FILTERING:

    SELECT * FROM channels WHERE channel_id = :channel_id ALLOW FILTERING
    

    This query scans the entire dataset because channel_id is not part of the partition key. ALLOW FILTERING is a Cassandra anti-pattern. It causes inefficiencies like cluster-wide reads, timeouts, and high resource consumption.

    Regarding the suggestion of duplicating the table:

    Duplicating the table is the correct way, but you seem to look for "links" (like foreign keys) to automatically synchronize both tables. Cassandra doesn't support this because it is not a relational database and embraces denormalization.

    Regarding Cassandra itself:

    Cassandra is designed for fast writes and scales horizontally (Write-Optimised). Duplicating data across tables is encouraged for different query patterns because writes are cheap, but reads can be expensive. Imagine having hundreds of nodes in your cluster. You want your query to involve only the minimum amount of nodes both when you write and when you read your data, so you can easily scale horizontally if needed. Cassandra's schema should be designed for your queries. If you have two queries (user_id and channel_id), you will need two tables with appropriate primary keys to optimize for both.

    Proposed solution:

    Create two tables:

    CREATE TABLE channels_by_user (
        user_id     bigint,
        bucket      int,
        channel_id  bigint,
        flags       int,
        permissions bigint,
        type        int,
        PRIMARY KEY ((user_id, bucket), channel_id)
    );
    

    and

    CREATE TABLE channels_by_channel_id (
        channel_id  bigint,
        bucket      int,
        user_id     bigint,
        flags       int,
        permissions bigint,
        type        int,
        PRIMARY KEY (channel_id, bucket)
    );
    

    Write to both tables simultaneously at the application level. Since writes in Cassandra are fast, this duplication will not significantly affect performance.