database database-design cassandra denormalization

Cassandra: denormalization and paging

I'm trying to understand and to get familiar with cassandra data models. This article explains some basic modeling rules:

https://www.ebayinc.com/stories/blogs/tech/cassandra-data-modeling-best-practices-part-1/

Option 3 shows an denormalized data model:

Am I getting the things right, that the "user_by_item" table has the following structure?

CREATE TABLE "user_by_item" (
    item_id int,
    users list<User>
    PRIMARY KEY (item_id)
)

If yes: it is clear that I can get all users by item_id with one query. But there is no possibility to page through the user list then.

Did I understand the table structure right and how are lists of items managed then, especially if they can get very large?

Solution

First of all, that article is 6 years old. For its time, it was a great article, but Cassandra has changed significantly since then. Case in point, collections didn't exist in Cassandra 1.1, which I think was the most-recent version at the time of this writing.

Am I getting the things right, that the "user_by_item" table has the following structure?

Yes, I think you are understanding it. Using item_id as a single PRIMARY KEY on users_by_item, while storing users as a collection is one way that you could do this. But, it limits your query flexibility to pulling all of the users back at once.

Probably the most query-friendly way to build that query table, is with a clustering key on user_id:

CREATE TABLE user_by_item (
  item_id int,
  user_id int,
  email text,
  name text,
  PRIMARY KEY ((item_id),user_id)
);

This way, I can query for all users tied to item 111:

aploetz@cqlsh:stackoverflow> SELECT * FROM user_by_item WHERE item_id=111;

 item_id | user_id | email   | name
---------+---------+---------+------
     111 |     123 | jp@ebay |  Jay
     111 |     456 | jd@ebay | John

(2 rows)

And I can also query just Jay, if I know his user_id:

aploetz@cqlsh:stackoverflow> SELECT * FROM user_by_item WHERE item_id=111
                               AND user_id=123;

 item_id | user_id | email   | name
---------+---------+---------+------
     111 |     123 | jp@ebay |  Jay

(1 rows)

This gives me a bit more query flexibility, while also storing all of the user data by item_id.

Pro tips:

Don't encapsulate your table name in double quotes unless you have to. It forces Cassandra to maintain its case, but can make data retrieval a frustrating experience later on.
When modeling for Cassandra, it is a common practice to use natural keys like name = "Jay." The whole point of a surrogate key like _id, was that something could be referenced from a main table without risking it being misspelled every time it was needed/stored. In Cassandra we don't have things like foreign keys, so natural keys help you cut out some unnecessary columns.
Primary keys in Cassandra cannot change. So the exception to the above rule, is if a primary key value is predicted to change (Jay legally changes his name, for instance) then using a surrogate key becomes a good idea.