I have to design the database for customers having prices for millions of materials they acquire through multiple suppliers for the next 24 months. So the database will store prices on a daily basis for every material supplied by a specific supplier for the next 24 months. So we keep past data. Now lookups will happen on:
I can think for the primary key as:
Another question is how can I make sure my data distributes evenly across nodes.
Note:
Key Points: 1. Some customers can have a very small data set while others can have huge data. How well we can distribute data evenly across partitions as Date is a constant field for all customers. Also, material Ids can be the same among different customers as that is there internal representation (maybe a numeric or alphanumeric)
Thanks.
It depends on how many suppliers do you have per customer. Because you always have queries on the customer ID & material ID, then I suggest that at least that columns are making into partition key. If you have too many suppliers, you can move it into partition key as well. And I would avoid having monthly bucket - it will make querying hard.
So you can go with following primary keys:
((customer, material, supplier), date)
((customer, material) supplier, date)
both will allow to have both queries:
select * from table where customer = ... and material = ... and supplier = ... and date = ...
select * from table where customer = ... and material = ... and supplier = ... and date >= start and date <= end
but I would recommend to go with 1st one the partitions won't be too big, and not too small.