I'm working on a database that will be storing products from multiple accounts. The number of products for a given account could vary significantly and I want to be able to easily query products for a specific account.
At the moment I have a table similar to this.
CREATE TABLE key.products (
product_id UUID,
account_id UUID,
sku TEXT,
other_details....,
PRIMARY KEY (account_id, product_id, sku)
);
This table will let me use queries like this quite easily.
SELECT product_id,sku,other_details FROM key.products WHERE account_id=@@@@;
But if I get a handful of accounts that have significantly more products than other accounts it would offset the partitions in Cassandra; and I would no longer have a good and equal distribution of data among my nodes. The data would still be relatively easy to query by account_id but is this okay? At what point will I be shooting myself in the foot for not partitioning with something else? And how can I change my approach to still query products in an account efficiently and minimize data skew?
Would partitioning by product_id instead and have an alternate table to query by account be more efficient? Something like.
CREATE TABLE key.products (
product_id UUID,
sku TEXT,
other_details....,
PRIMARY KEY (product_id, sku)
);
CREATE TABLE key.products_by_account (
account_id UUID,
product_id UUID,
PRIMARY KEY (account_id, product_id)
);
The data would still skew in the products_by_account table but the size of the data would be much smaller since it does not contain all the data in the main products table. Is this better?
All data modeling in Cassandra happens around queries - you need to think how queries will look like...
In your case, the good distribution could be with composite partition key, like (account_id, product_id)
.
Another approach, is to add some kind of bucketing to "big" accounts - for example, split account's data into N buckets, and use key like (account_id, X)
, where X is between 0 & N. In this case, if you ever need to have need to fetch all products for given account, you can issue N queries in parallel to fetch everything. Instead of number, you can use product's categories, or something like, that has fixed & known set of values.