Search code examples
databasecassandranosqlcassandra-3.0scylla

Keeping Data In Denormalized NoSql Databases (Cassandra/ScyllaDb) Tables In Sync?


I am new to NoSql databases like Cassandra/ScyllaDb and trying to wrap my head around NoSql database tables design and how to keep duplicated data in multiple tables in sync. I just read this post about NoSql Data Modeling and here are the important screenshots from the post.

Application workflow

enter image description here

Chebotko Diagram

enter image description here

I have some few questions regarding the tables design and keeping duplicated data in multiple tables in sync.

  1. There are tables for users and videos. Why isn't any table for comments since it is also an entity.

  2. If a user is adding a comment to a video there has to be two inserts statements I think. One insert for the comments_by_user table and another insert for the comments_by_video table and the commentId value must be the same for the two tables. How do you make sure the commentId in the two tables is the same? Do you for instance assign the timeuuid to a variable and then use the variable in the commentId slot for the two insert statements?

    Example:

    var myuuid = timeuuid();
    
    insert into comments_by_user(...)values(..., myuuid); 
    
    insert into comments_by_video(...) values(...,myuuid);   
    

    The reason am asking this is if a user uses for instance a timeuuid() function in both insert statements then the timeuuid will not be the same for the commentid in both tables because timeuuid() will generate different values.

  3. Should in case there should be a comments table then if a user wants to add comments then there should be three inserts statements. The first insert statement to the comments table and the the comment id that is returned will be used to insert into comments_by_user and comments_by_video.


Solution

  • Disclaimer: I have not worked with Cassandra only with ScyllaDB, but since the latter is designed as a stand-in replacement for the former, the same designing principles should be applied.

    Why isn't any table for comments since it is also an entity.

    You don't design the schema/table structure against entities, you design it against queries which will be executed. It is called query-first data model (in ScyllaDB docs). It allows you to optimize the data structure against actual queries which will be executed against your database.

    Since there is no requirement (and correspondingly - query) to return "any" comments, but only comments per video and per user (and since the comment itself is usually expected to be relatively small) then you don't need to have a dedicated comments table.

    How do you make sure the commentId in the two tables is the same? Do you for instance assign the timeuuid to a variable and then use the variable in the commentId slot for the two insert statements?

    Basically yes, using such datatypes like timeuuid/Guid/at least some versions of uuid allows you to generate them on the client side with very-very-very small (so small that you can ignore it) chance of collisions (from the docs: "timeuuid- Version 1 UUID, generally used as a “conflict-free” timestamp. See Working with UUIDs for details")

    Should in case there should be a comments table then if a user wants to add comments then there should be three inserts statements.

    It depends. First of all as discussed previously - in the described scenario there is no need for such table. But if you will need to query comments (not by user/video) then it actually depends. If you still can consider comment being small - then yes, you can have comment placed in three tables, if for some reason comments are considered "big" then the cost of data duplication can become too big and you might want to store it only in the "entity" table (then you will need 2 queries instead of one to fetch the needed data).

    As for the keeping the data in sync, there are multiple approaches you can try (in general):

    • using batches which allows to group several DML operations which:

      By default, all operations in the batch are performed as logged, to ensure all mutations eventually complete (or none will).

    • handle DML on the application level (for example via retryable queue or some kind of transactional outbox/saga/CDC approach)

    • use materialized views