Search code examples
cassandranosqlpartitioningtemporal-workflow

How to decide the number of partitions when setting up Temporal cluster?


In non-RDBMS, where increasing the number of partitions can speed up writes and reads by parallelism, what is the downside of having too many partitions?

Lets say in Cassandra, the partition key for every row is unique. Whats the downside of this? On the other hand, what if you have decided on a very few number on fixed number of partitions.

For example, in Temporal (which uses Cassandra), you need to specify a fixed number of partitions while setting up the Temporal cluster. If you have too few partitions, your read/write performance is low at higher loads. But if you increase the number of partitions to a very high number, the resource consumption increases. Why does the resource consumption increase with more number of partitions? ( in a general sense, not limited to Temporal )

Edit: Relation to memtable flushing. Will more number of partitions cause frequent flushing of the memtable into SSTable, and hence trigger far more compaction and increase the resource usage? If yes, why is number of partitions linked to frequency of memtable flushing?


Solution

  • Now let's preface this by saying that partitions have many different definitions and there is where it can easily be confusing. Especially if Temporal uses partitions to partition Cassandra partitions. But let us assume we mean partitions as in Cassandra partitions. Everything is partitioned by the primary key, and the other columns form rows under each partition.

    Cassandra is designed to scale well with the number of partitions but not as well with the size of rows. That is if you have a low number of partition and add rows upon each partition, you are eventually going to run into issues.

    The most common issue here is that Cassandra reads an entire partition into memory when it is read, therefore the larger the partition, the more GC pressure you get and the longer the latency on that read.

    Flushing has only a direct relation to number of writes. Whether you have less or more partitions shouldn't really matter in this case.

    However, all of this gets a lot more complicated when deletes/tombstones are involved but that's a whole other can of worms.

    I am not all too familiar with Temporal's usage of Cassandra, but my guess is that it uses 'buckets', bucketing one 'workflow or queue' over multiple Cassandra partitions as a way to keep rows a reasonable size.

    If your partitions are too large, then read performance will be impacted because it will require a large read to seek within the partition. On the other side of the spectrum, if you split your 'workflow or queue' into too many partitions, you will need to read all of the different partitions to fetch your queue, thereby impacting resources of the server.