Search code examples
cassandracassandra-2.1

Should Data directories of cassandra cluster nodes be identical?


Let's say I have a 2 node cluster where all the nodes have identical data_file_directories (with say 3 folder) configured in cassandra.yaml For example

data_file_directories:
    - E:/Cassandra/data/var/lib/cassandra/data
    - K:/Cassandra/data/var/lib/cassandra/data
    - F:/Cassandra/data/var/lib/cassandra/data

Now let's say I add a 3rd node to cluster with different data_file_directories (with say 1 folder)

 data_file_directories:
    - B:/Cassandra/data/var/lib/cassandra/data

Is it incorrect to do so ? During re-balance of data, will data from 3 directories of existing node flow to 1 directory of new node ?


Solution

  • Nate McCall (the current Apache Cassandra Project Chair) answered a similar question here: How does cassandra split keyspace data when multiple directories are configured?

    In short, this should be fine. Cassandra spreads the data across the entries in data_file_directories evenly, regardless of how many are there. Additionally, the number of tokens that a node is responsible for is independent of this setting, so you shouldn't see any hot-spots or unbalancing (at least not due to this).

    That being said, I'll add the following points:

    • Specifying multiple data directories can help if they are different physical mount points. That way if one disk should fill-up or fail unexpectedly, the node can still keep running.
    • If I was planning on adding a node and keeping all of them for the long-term, I would specify the config of the new node as close as I could to the original nodes. This especially helps in a big environment when you are responsible for multiple nodes and clusters, to not have to remember how/why one particular node in a cluster is different should you need to troubleshoot.
    • The exception to the last point, would be if I had decided to move to a single data directory going forward. But then I'd also have a plan to decommission the existing nodes and replace them with nodes that also had a similar configuration.

    Pro-tip: If you can, try to use an automated deployment tool like Chef or Spinnaker. That way the config of a new node is essentially a "cookie-cutter" of all the other nodes in your cluster.