Search code examples
databasedatabase-designcassandradenormalization

Cassandra: denormalization and paging


I'm trying to understand and to get familiar with cassandra data models. This article explains some basic modeling rules:

https://www.ebayinc.com/stories/blogs/tech/cassandra-data-modeling-best-practices-part-1/

Option 3 shows an denormalized data model:

enter image description here

Am I getting the things right, that the "user_by_item" table has the following structure?

CREATE TABLE "user_by_item" (
    item_id int,
    users list<User>
    PRIMARY KEY (item_id)
)

If yes: it is clear that I can get all users by item_id with one query. But there is no possibility to page through the user list then.

Did I understand the table structure right and how are lists of items managed then, especially if they can get very large?


Solution

  • First of all, that article is 6 years old. For its time, it was a great article, but Cassandra has changed significantly since then. Case in point, collections didn't exist in Cassandra 1.1, which I think was the most-recent version at the time of this writing.

    Am I getting the things right, that the "user_by_item" table has the following structure?

    Yes, I think you are understanding it. Using item_id as a single PRIMARY KEY on users_by_item, while storing users as a collection is one way that you could do this. But, it limits your query flexibility to pulling all of the users back at once.

    Probably the most query-friendly way to build that query table, is with a clustering key on user_id:

    CREATE TABLE user_by_item (
      item_id int,
      user_id int,
      email text,
      name text,
      PRIMARY KEY ((item_id),user_id)
    );
    

    This way, I can query for all users tied to item 111:

    aploetz@cqlsh:stackoverflow> SELECT * FROM user_by_item WHERE item_id=111;
    
     item_id | user_id | email   | name
    ---------+---------+---------+------
         111 |     123 | jp@ebay |  Jay
         111 |     456 | jd@ebay | John
    
    (2 rows)
    

    And I can also query just Jay, if I know his user_id:

    aploetz@cqlsh:stackoverflow> SELECT * FROM user_by_item WHERE item_id=111
                                   AND user_id=123;
    
     item_id | user_id | email   | name
    ---------+---------+---------+------
         111 |     123 | jp@ebay |  Jay
    
    (1 rows)
    

    This gives me a bit more query flexibility, while also storing all of the user data by item_id.

    Pro tips:

    • Don't encapsulate your table name in double quotes unless you have to. It forces Cassandra to maintain its case, but can make data retrieval a frustrating experience later on.
    • When modeling for Cassandra, it is a common practice to use natural keys like name = "Jay." The whole point of a surrogate key like _id, was that something could be referenced from a main table without risking it being misspelled every time it was needed/stored. In Cassandra we don't have things like foreign keys, so natural keys help you cut out some unnecessary columns.
    • Primary keys in Cassandra cannot change. So the exception to the above rule, is if a primary key value is predicted to change (Jay legally changes his name, for instance) then using a surrogate key becomes a good idea.