I am trying to understand how to choose the shard key in Clickhosue ? and how clickhosue chooses which shard? for example, i have a table with 3 columns : user_id, timestamp, city_id.
should i shard by user_id or by City?
i use murmurHash3_64 function.
murmurHash3_64(city_id = 1) return :
┌──murmurHash3_64(1)─┐
│ 956517343494314387 │
└────────────────────┘
what is the mechanism in Clickhouse to choose the shard id when i have 3 shards or 2 shards?
https://clickhouse.tech/docs/en/engines/table-engines/special/distributed/
First of all you need to understand why do you need sharding by some meaningful column. Why you could not use rand()
?
Usually it's not a question what to use, because the sharding naturally follows the business requirements. If you don't have such requirements then you should use rand()
.
CH uses modulo operation + weight . It's very simple. If you have 6 shards then 956517343494314387 % 6 = 5 === shard number 5. So the rows with the same city_id will be placed on the same shard.
So if you chose city_id as shard key and the distribution by the city usually unequal so the shading will be unequal too. All rows for the big cities like New York or Mexico will in the one shard.
So user_id looks more appropriate as sharding key.