Search code examples
cassandradatamodel

Internal network application data model with Cassandra


I'm working on designing an application which will enable users to send requests to connect with each other, see their sent or received requests, make notes during their interactions for later reference if connected, and remove users from their contact lists.

In a RDBMS, the schema would be:

table User with column

  • uid (a unique string for each user)

table Request with columns:

  • from - user id

  • to - user id Primary Key (from, to)

  • created - timestamp

  • message - string

  • expiry - timestamp

table Connection with columns:

  • from - user id

  • to - user id

Primary Key (from, to)

  • notes - String

  • created - timestamp

  • modified - timestamp

  • isFavourite - to is a favourite of from user, value 0 or 1

  • isActive - soft delete, value 0 or 1

  • pairedConnection - shows whether the connection between to and from was deactivated (the to user removed the from user from its contact list), value 0 or 1

The queries I anticipate to be needed are:

  • find the sent requests for a user

  • find the received requests for a user

  • find all the active contacts of a given user

  • find all the favourites of a user

  • find all the users who deleted the given from user from their lists

  • update the notes taken by a user when meeting another user he is connected with

  • update user as favourite

  • mark connection for soft deletion

I'm trying to model this in Cassandra, but feel confused about the keys to choose for max efficiency.

So far, I have the following ideas, and would welcome feedback from more experienced Cassandra users:

create table users(
uid text PRIMARY KEY
); 

create table requestsByFrom(
from text,
to text,
message text,
created timestamp,
expiry timestamp,
PRIMARY KEY (from,to)

create table requestsByTo(
from text,
to text,
message text,
created timestamp,
expiry timestamp,
PRIMARY KEY (to,from)
);

create table connections(
from text,
to text,
notes text,
created timestamp,
modified timestamp,
isFavourite boolean,
isActive boolean,
pairedConnection boolean,
PRIMARY KEY (from,to)
);

create table activeConnections(
from text,
to text,
isActive boolean,
PRIMARY KEY (from,isActive)
);

create table favouriteConnections(
from text,
to text,
isFavourite boolean,
PRIMARY KEY (from, isFavourite)
);

create table pairedConnection(
from text,
to text,
pairedConnection boolean,
PRIMARY KEY ((from,to), pairedConnection)
);

Solution

  • Cassandra has a different paradigm to RDBMS, and this is more evident with the way that the data modeling has to be done. You need to keep in mind that denormalization is preferred, and that you'll have repeated data.

    The tables definition should be based on the queries to retrieve the data, this is partially stated in the definition of the problem, for instance:

    find the sent requests for a user

    Taking the initial design of the table requestsByFrom, an alternative will be

    CREATE TABLE IF NOT EXISTS requests_sent_by_user(
        requester_email TEXT,
        recipient_email TEXT,
        recipient_name TEXT,
        message TEXT,
        created TIMESTAMP
    PRIMARY KEY (requester_email, recipient_email)
    ) WITH default_time_to_live = 864000;
    

    Note that from is a restricted keyword, the expiry information can be set with the definition of the default_time_to_live clause (TTL) which will remove the record after the time defined; this value is the amount of seconds after the record is inserted, and the example is 10 days (864,000 seconds).

    The primary key is suggested to be the email address, but it can also be an UUID, name is not recommended as there can be multiple persons sharing the same name (like James Smith) or the same person can have multiple ways to write the name (following the example Jim Smith, J. Smith and j smith may refer to the same person).

    The name recipient_name is also added as it is most likely that you'll want to display it; any other information that will be displayed/used with the query should be added.

    find the received requests for a user

    CREATE TABLE IF NOT EXISTS requests_received_by_user(
        recipient_email TEXT,
        requester_email TEXT,
        requester_name TEXT,
        message TEXT,
        created TIMESTAMP
    PRIMARY KEY (recipient_email, requester_email)
    ) WITH default_time_to_live = 864000;
    

    It will be preferred to add records to requests_sent_by_user and requests_received_by_user at the same time using a batch, which will ensure consistency in the information between both tables, also the TTL (expiration of the data) will be the same.

    storing contacts

    In the question there are 4 tables of connections: connections, active_connections, favourite_connections, paired_connections, what will be the difference between them? are they going to have different rules/use cases? if that is the case, it makes sense to have them as different tables:

    CREATE TABLE IF NOT EXISTS connections(
        requester_email TEXT,
        recipient_email TEXT,
        recipient_name TEXT,
        notes TEXT,
        created TIMESTAMP,
        last_update TIMESTAMP,
        is_favourite BOOLEAN,
        is_active BOOLEAN,
        is_paired BOOLEAN,
        PRIMARY KEY (requester_email, recipient_email)
     );
    
    CREATE TABLE IF NOT EXISTS active_connections(
        requester_email TEXT,
        recipient_email TEXT,
        recipient_name TEXT,
        last_update TIMESTAMP,
        PRIMARY KEY (requester_email, recipient_email)
    );
    
    CREATE TABLE IF NOT EXISTS favourite_connections(
        requester_email TEXT,
        recipient_email TEXT,
        recipient_name TEXT,
        last_update TIMESTAMP,
        PRIMARY KEY (requester_email, recipient_email)
    );
    
    CREATE TABLE IF NOT EXISTS paired_connections(
        requester_email TEXT,
        recipient_email TEXT,
        recipient_name TEXT,
        last_update TIMESTAMP,
        PRIMARY KEY (requester_email, recipient_email)
    );
    

    Note that the boolean flag is removed, the logic is that if the record exists in active_connections, it will be assumed that it is an active connection.

    When a new connection is created, it may have several records in different tables; to bundle all those inserts or updates, it is preferred to use batch

    find all the active contacts of a given user

    Based on the proposed tables, if the requester's email is test@email.com:

    SELECT * FROM active_connections WHERE requester_email = 'test@email.com'
    

    update user as favourite

    It will be a batch updating the record in connections and adding the new record to favourite_connections:

    BEGIN BATCH
    
    UPDATE connections 
    SET is_favourite = true, last_update = dateof(now())
    WHERE requester_email ='test@email.com' 
      AND recipient_email = 'john.smith@test.com';
    
    INSERT INTO favourite_connections (
        requester_email, recipient_email, recipient_name, last_update
    ) VALUES (
        'test@email.com', 'john.smith@test.com', 'John Smith', dateof(now())
    );
    APPLY BATCH;
    

    mark connection for soft deletion

    The information of the connection can be kept in connections with all the flags disabled, as well as the records removed from active_connections, favourite_connections and paired_connections

    BEGIN BATCH
    
    UPDATE connections 
    SET is_active = false, is_favourite = false,
        is_paired = false, last_update = dateof(now())
    WHERE requester_email ='test@email.com' 
      AND recipient_email = 'john.smith@test.com';
    
    DELETE FROM active_connections 
    WHERE requester_email = 'test@email.com' 
      AND recipient_email = 'john.smith@test.com';
    
    DELETE FROM favourite_connections 
    WHERE requester_email = 'test@email.com' 
      AND recipient_email = 'john.smith@test.com';
    
    DELETE FROM paired_connections 
    WHERE requester_email = 'test@email.com' 
      AND recipient_email = 'john.smith@test.com';
    
    APPLY BATCH;