Search code examples
cassandrapartitioningsharding

Sharing large partition key in Cassandra: how to keep a fixed shard size?


I read this post on how to deal with large partitions and partitioning hotspots, their solution is to add a sharding key as part of the partition key, and keep the shard size at a fixed size, say 1000. The fixed shard size even helps pagination.

But my question is, how can we keep a fixed shard size? In my understanding, the common approach for hotspot issue is to add a sharding key (random_number % n for example) to the partition key to split hotspots, but it does not guarantee to limit shard size, wondering how it works out in their approach.


Solution

  • The post details a number of approaches to the sharding problem - the first just has a number of shards but no way in which to track how many shard you have stored for a given partition.

    The second solution adds a static count column which provides that information but it will fall foul of read-before-write and race conditions on inserting to the shard, especially when data is inserted in parallel. If you get around that however, and you know the size of a row is relatively static, then you can use the counter to control the size (approximately). If the size varies considerably, its a rough guess at best.

    The third solution is the same as your %n - in that a fixed number of shards, or 'buckets' are assumed to exist.

    I would start the process though by calculating what you expect each partition to contain, and work from there, and not to prematurely optimise.