Search code examples
design-patternscassandradatabase-partitioning

Suggestion required to design partitioning in cassandra


I have to design the database for customers having prices for millions of materials they acquire through multiple suppliers for the next 24 months. So the database will store prices on a daily basis for every material supplied by a specific supplier for the next 24 months. So we keep past data. Now lookups will happen on:

  1. Find the price for a material by the supplier as of the specific date by a customer.
  2. Find the price for a material by the supplier for a time period by a customer.

I can think for the primary key as:

  1. Partition key: (customer Id, material Id, supplier Id, date) -- will this end up in perf issues as it will make so many partitions in long run?
  2. Partition Key: (customer Id, material Id, supplier Id, monthbucket), clustering key: date -- monthbucket will store data for material on monthly basis on same partition and will have value like '202002' for Feb 2020 dates.

Another question is how can I make sure my data distributes evenly across nodes.

Note:

  1. combination of customer, material, supplier and date is unique.
  2. two customer can have similar material ids.

Key Points: 1. Some customers can have a very small data set while others can have huge data. How well we can distribute data evenly across partitions as Date is a constant field for all customers. Also, material Ids can be the same among different customers as that is there internal representation (maybe a numeric or alphanumeric)

  1. Number of Suppliers per customer and material id varies from 1 - 20 in number. do you have any suggestions or questions?

Thanks.


Solution

  • It depends on how many suppliers do you have per customer. Because you always have queries on the customer ID & material ID, then I suggest that at least that columns are making into partition key. If you have too many suppliers, you can move it into partition key as well. And I would avoid having monthly bucket - it will make querying hard.

    So you can go with following primary keys:

    1. ((customer, material, supplier), date)
    2. ((customer, material) supplier, date)

    both will allow to have both queries:

    1. select * from table where customer = ... and material = ... and supplier = ... and date = ...
    2. select * from table where customer = ... and material = ... and supplier = ... and date >= start and date <= end

    but I would recommend to go with 1st one the partitions won't be too big, and not too small.