Trying to define the right schema / table for our scenario:
We have few hundreds of eCommerce sites, each one of them has unique siteId
.
Each site has it own end-users, up to 10M unique users per month. Each user has unique userId
.
Each end-user interacts with the site: view products, add products to cart and purchase products (we call it user events). I want to store the activities of the last 30 days (or 180 days if it possible).
Things to consider:
How to define the key here?
+--------+---------+------------+-----------+-----------+-----------+
| siteId | userId | timestamp | eventType | productId | other ... |
+--------+---------+------------+-----------+-----------+-----------+
| 1 | Value 2 | 1501234567 | view | abc | |
| 1 | cols | 1501234568 | purchase | abc | |
+--------+---------+------------+-----------+-----------+-----------+
My query is: Get all events (and their metadata) of specific user. As I assumed above, around 100 events.
Edit2:I guess it wasn't clear, but the uniqueness of users is per site, two different users might have the same id if they are on different sites
If you want to query for the userid than the userid should be the first part of your compound primary key (this is the partition key). Use a compound primary key to create columns that you can query to return sorted results. I would suggest the following schema:
CREATE TABLE user_events (
userid long,
timestamp timestamp,
event_type text,
site_id long,
product_id long,
PRIMARY KEY (userid, site_id, timestamp, product_id));
That should make queries like
SELECT * FROM user_events WHERE user_id = 123 and site_id = 456;
quite performant. By adding the timestamp to the PK you can also easily LIMIT your queries to get only the top(latest) 1000 (whatever you need) events without getting into performance issues because of high active users (or bots) having a very long history.
One thing to keep mind: I would recommend to have the user_id or a composition of user_id, site_id as the partition key (the first part of the primary key). That will prevent your rows from becoming too big.
So an alternative design would look like this:
CREATE TABLE user_events (
userid long,
timestamp timestamp,
event_type text,
site_id long,
product_id long,
PRIMARY KEY ( (userid, site_id), timestamp, product_id));
The "downside" of this approach is that you always have to provide user and site-id. But I guess that is something that you have to do anyways, right?
To point out one thing. The partition key (also called to row id) identifies a row. A row will stay on specific node. For this reason it is a good idea to have the rows more or less of the same size. A row with a couple of thousands or 10ks of columns is not really a problem. You will get problems if you have some rows with millions of columns and other rows with only 10-20 columns. That will cause the cluster to be inbalanced. Furthermore it makes the row caches less effictive. In your example I would suggest to avoid to have the site_id as the partition key (row key).
Does that make sense to you? Maybe the excelent answer to this post give you some more insides: difference between partition-key, composite-key and clustering-key. Furthermore a closer look at this part of the datastax documentation offers some more details.
Hope that helps.