Let's say I have a 2 node cluster where all the nodes have identical data_file_directories (with say 3 folder) configured in cassandra.yaml For example
data_file_directories:
- E:/Cassandra/data/var/lib/cassandra/data
- K:/Cassandra/data/var/lib/cassandra/data
- F:/Cassandra/data/var/lib/cassandra/data
Now let's say I add a 3rd node to cluster with different data_file_directories (with say 1 folder)
data_file_directories:
- B:/Cassandra/data/var/lib/cassandra/data
Is it incorrect to do so ? During re-balance of data, will data from 3 directories of existing node flow to 1 directory of new node ?
Nate McCall (the current Apache Cassandra Project Chair) answered a similar question here: How does cassandra split keyspace data when multiple directories are configured?
In short, this should be fine. Cassandra spreads the data across the entries in data_file_directories
evenly, regardless of how many are there. Additionally, the number of tokens that a node is responsible for is independent of this setting, so you shouldn't see any hot-spots or unbalancing (at least not due to this).
That being said, I'll add the following points:
Pro-tip: If you can, try to use an automated deployment tool like Chef or Spinnaker. That way the config of a new node is essentially a "cookie-cutter" of all the other nodes in your cluster.