Search code examples
cassandracql

Partitioning data in a multi-tenant format efficiently.


I'm working on a database that will be storing products from multiple accounts. The number of products for a given account could vary significantly and I want to be able to easily query products for a specific account.

At the moment I have a table similar to this.

CREATE TABLE key.products (
    product_id UUID,
    account_id UUID,
    sku TEXT,
    other_details....,
    PRIMARY KEY (account_id, product_id, sku)
);

This table will let me use queries like this quite easily.

SELECT product_id,sku,other_details FROM key.products WHERE account_id=@@@@;

But if I get a handful of accounts that have significantly more products than other accounts it would offset the partitions in Cassandra; and I would no longer have a good and equal distribution of data among my nodes. The data would still be relatively easy to query by account_id but is this okay? At what point will I be shooting myself in the foot for not partitioning with something else? And how can I change my approach to still query products in an account efficiently and minimize data skew?

Would partitioning by product_id instead and have an alternate table to query by account be more efficient? Something like.

CREATE TABLE key.products (
    product_id UUID,
    sku TEXT,
    other_details....,
    PRIMARY KEY (product_id, sku)
);

CREATE TABLE key.products_by_account (
    account_id UUID,
    product_id UUID,
    PRIMARY KEY (account_id, product_id)
);

The data would still skew in the products_by_account table but the size of the data would be much smaller since it does not contain all the data in the main products table. Is this better?


Solution

  • All data modeling in Cassandra happens around queries - you need to think how queries will look like...

    In your case, the good distribution could be with composite partition key, like (account_id, product_id).

    Another approach, is to add some kind of bucketing to "big" accounts - for example, split account's data into N buckets, and use key like (account_id, X), where X is between 0 & N. In this case, if you ever need to have need to fetch all products for given account, you can issue N queries in parallel to fetch everything. Instead of number, you can use product's categories, or something like, that has fixed & known set of values.