I'm trying to understand and to get familiar with cassandra data models. This article explains some basic modeling rules:
https://www.ebayinc.com/stories/blogs/tech/cassandra-data-modeling-best-practices-part-1/
Option 3 shows an denormalized data model:
Am I getting the things right, that the "user_by_item" table has the following structure?
CREATE TABLE "user_by_item" (
item_id int,
users list<User>
PRIMARY KEY (item_id)
)
If yes: it is clear that I can get all users by item_id with one query. But there is no possibility to page through the user list then.
Did I understand the table structure right and how are lists of items managed then, especially if they can get very large?
First of all, that article is 6 years old. For its time, it was a great article, but Cassandra has changed significantly since then. Case in point, collections didn't exist in Cassandra 1.1, which I think was the most-recent version at the time of this writing.
Am I getting the things right, that the "user_by_item" table has the following structure?
Yes, I think you are understanding it. Using item_id as a single PRIMARY KEY
on users_by_item, while storing users as a collection is one way that you could do this. But, it limits your query flexibility to pulling all of the users back at once.
Probably the most query-friendly way to build that query table, is with a clustering key on user_id
:
CREATE TABLE user_by_item (
item_id int,
user_id int,
email text,
name text,
PRIMARY KEY ((item_id),user_id)
);
This way, I can query for all users tied to item 111:
aploetz@cqlsh:stackoverflow> SELECT * FROM user_by_item WHERE item_id=111;
item_id | user_id | email | name
---------+---------+---------+------
111 | 123 | jp@ebay | Jay
111 | 456 | jd@ebay | John
(2 rows)
And I can also query just Jay, if I know his user_id
:
aploetz@cqlsh:stackoverflow> SELECT * FROM user_by_item WHERE item_id=111
AND user_id=123;
item_id | user_id | email | name
---------+---------+---------+------
111 | 123 | jp@ebay | Jay
(1 rows)
This gives me a bit more query flexibility, while also storing all of the user data by item_id
.
Pro tips:
name
= "Jay." The whole point of a surrogate key like _id
, was that something could be referenced from a main table without risking it being misspelled every time it was needed/stored. In Cassandra we don't have things like foreign keys, so natural keys help you cut out some unnecessary columns.name
, for instance) then using a surrogate key becomes a good idea.