Search code examples
databasecassandracqldatabase-normalizationscylla

Cassandra table definition / partitioning / modeling


Trying to define the right schema / table for our scenario: We have few hundreds of eCommerce sites, each one of them has unique siteId.

Each site has it own end-users, up to 10M unique users per month. Each user has unique userId.

Each end-user interacts with the site: view products, add products to cart and purchase products (we call it user events). I want to store the activities of the last 30 days (or 180 days if it possible).

Things to consider:

  • Site sizes are different! We have some "heavy" sites with 10M end users but we also have "light" sites with a few hundreds/thousands of users.
  • Events don't have unique ids.
  • Users can have more than one event at a time, for example they can a view page with more than one product (but we could live without that restriction to simplify).
  • Rough estimation: 100 Customers x 10M EndUsers x 100 Interactions = 100,000,000,000 rows (per month)
  • Writes done in realtime (when the event arrive to the server). Reads done much less (1% of the events).
  • Events have some more metadata and different events (view/purchase/..) have different metadata.
  • Using Keyspace to separate between sites, and manage table per each site vs. all customers in one table.
  • How to define the key here?

    +--------+---------+------------+-----------+-----------+-----------+
    | siteId | userId  | timestamp  | eventType | productId | other ... |
    +--------+---------+------------+-----------+-----------+-----------+
    |      1 | Value 2 | 1501234567 | view      | abc       |           |
    |      1 | cols    | 1501234568 | purchase  | abc       |           |
    +--------+---------+------------+-----------+-----------+-----------+
    

My query is: Get all events (and their metadata) of specific user. As I assumed above, around 100 events.

Edit2:I guess it wasn't clear, but the uniqueness of users is per site, two different users might have the same id if they are on different sites


Solution

  • If you want to query for the userid than the userid should be the first part of your compound primary key (this is the partition key). Use a compound primary key to create columns that you can query to return sorted results. I would suggest the following schema:

    CREATE TABLE user_events (
           userid long,
           timestamp timestamp,
           event_type text,
           site_id long,
           product_id long,
    PRIMARY KEY (userid, site_id, timestamp, product_id));
    

    That should make queries like

    SELECT * FROM user_events WHERE user_id = 123 and site_id = 456;
    

    quite performant. By adding the timestamp to the PK you can also easily LIMIT your queries to get only the top(latest) 1000 (whatever you need) events without getting into performance issues because of high active users (or bots) having a very long history.

    One thing to keep mind: I would recommend to have the user_id or a composition of user_id, site_id as the partition key (the first part of the primary key). That will prevent your rows from becoming too big.

    So an alternative design would look like this:

    CREATE TABLE user_events (
           userid long,
           timestamp timestamp,
           event_type text,
           site_id long,
           product_id long,
    PRIMARY KEY ( (userid, site_id), timestamp, product_id));
    

    The "downside" of this approach is that you always have to provide user and site-id. But I guess that is something that you have to do anyways, right?

    To point out one thing. The partition key (also called to row id) identifies a row. A row will stay on specific node. For this reason it is a good idea to have the rows more or less of the same size. A row with a couple of thousands or 10ks of columns is not really a problem. You will get problems if you have some rows with millions of columns and other rows with only 10-20 columns. That will cause the cluster to be inbalanced. Furthermore it makes the row caches less effictive. In your example I would suggest to avoid to have the site_id as the partition key (row key).

    Does that make sense to you? Maybe the excelent answer to this post give you some more insides: difference between partition-key, composite-key and clustering-key. Furthermore a closer look at this part of the datastax documentation offers some more details.

    Hope that helps.