Search code examples
cassandracassandra-2.1

Will diskspace increase if I increase number of nodes in Cassandra Cluster?


I have a situation in Cassandra cluster (deployed over ec2 instance) such that, the disk space is going to run out of space in each node of the cluster. Now if I add some more instances in the Cassandra cluster, will it increase disk space?

What i mean, whenever we are running out of space, can we add more instances to cassandra cluster to inrease overall disk space?

Is it a right way to do, If so?


Solution

  • What i mean, whenever we are running out of space, can we add more instances to cassandra cluster to inrease overall disk space?

    Yes, and yes.

    Consider a 4 node cluster, with a replication factor (RF) of 3, with 100GB of storage per node. Assume that the initial complete copy of the data footprint is 60GB. With 4 nodes and a RF of 3, each node will be responsible for 3/4 of the data, or 45GiB.

    Address      Load      Owns      Total
    10.0.0.1     45.0 GiB  75.0%     100Gb
    10.0.0.2     45.0 GiB  75.0%     100Gb
    10.0.0.3     45.0 GiB  75.0%     100Gb
    10.0.0.4     45.0 GiB  75.0%     100Gb
    

    With size tiered compaction (default) you want to keep each node at under 50% of total disk usage. This set up allows for that.

    However, let's say the app team runs a big load overnight. We come in tomorrow morning, and find this:

    Address      Load      Owns      Total
    10.0.0.1     70.0 GiB  75.0%     100Gb
    10.0.0.2     70.0 GiB  75.0%     100Gb
    10.0.0.3     70.0 GiB  75.0%     100Gb
    10.0.0.4     70.0 GiB  75.0%     100Gb
    

    Essentially, a complete copy of the data has grown to 93.3 GiB. To bring the amount of data per disk back down below 50%, we will have to add more nodes.

    But how many?

    If we add a single node (maintaining a RF of 3), that means each node becomes responsible for 3/5 (60% of the data), which is 55.98 GiB. Close, but not quite there.

    If we add two nodes, that brings us to a total of 6, which means that each node is responsible for 50% of the data, which is 46.65 GiB. That does bring us back under %50 per node, so we should add at least two nodes.

    After doing so, the cluster should look like this:

    Address      Load       Owns      Total
    10.0.0.1     46.65 GiB  50.0%     100Gb
    10.0.0.2     46.65 GiB  50.0%     100Gb
    10.0.0.3     46.65 GiB  50.0%     100Gb
    10.0.0.4     46.65 GiB  50.0%     100Gb
    10.0.0.5     46.65 GiB  50.0%     100Gb
    10.0.0.6     46.65 GiB  50.0%     100Gb
    

    Note, that simply bootstrapping in new nodes only moves data to those nodes. It does not remove it from the existing nodes. For that, you should run a nodetool cleanup on each pre-existing node.