We have always wondered why one of our clusters is showing that an Analytics node owns data. I have edited, ips, tokens, and host ids for readability
% nodetool status
Datacenter: Cassandra
=====================
Status=Up/Down|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Token Rack
UN 172.32.x.x 46.83 GB 18.5% someguid 0 rack1
UN 172.32.x.x 60.26 GB 33.3% anotherguid ranbignumber rack1
UN 172.32.x.x 63.51 GB 14.8% anothergui ranbignumber rack1
Datacenter: Analytics
=====================
Status=Up/Down|/ State=Normal/Leaving/Joining/Moving
-- Address Load Owns Host ID Token Rack
UN 172.32.x.x 28.91 GB 0.0% someguid 100 rack1
UN 172.32.x.a 30.41 GB 33.3% someguid ranbignumber rack1
UN 172.32.x.x 17.46 GB 0.0% someguid ranbignumber rack1
So does the Analytics node with ip 172.32.x.a actually own data? If so do we need to back it up? Also would decommissioning the node move the data back into the appropriate nodes?
This is the node that I am referring to from the above nodetool status that is in the Datacenter Analytics:
UN 172.32.x.a 30.41 GB 33.3% someguid ranbignumber rack1
Again the questions (updated with answers provided below).
Here is the update for
% nodetool status our_important_keyspace
Datacenter: Cassandra
=====================
Status Address Load Owns (effective)
UN 2 63.16 GB 81.5%
UN 1 47.21 GB 33.3%
UN 3 59.87 GB 85.2%
Datacenter: Analytics
=====================
Status Address Load Owns (effective)
UN 3 17.74 GB 33.3%
UN 2 30.62 GB 33.3%
UN 1 29.21 GB 33.3%
Backing up Analytics today - awesome answer, and probably saved us a TON of pain.
The first thing you need to do is run nodetool status or dsetool ring using the keyspace that your data is stored in. This will show you the ownership as dictated by replication strategy of that keyspace. What you are looking at now i s most likely the ownership as set by the raw token values. If your keyspace was named "important_data" you would run "nodetool status important_data".
This replication strategy on your keyspace is key to determining what nodes are responsible for data in your cluster. In any case a multi DC cluster should be using a NetworkTopologyStrategy which allows specifying how many replicas of your data should live in each Datacenter. For example if you wanted to make sure the data was replicated twice in the Cassandra cluster but only once in the Analytics cluster you would use a network topology strategy like, {'Cassandra':2, 'Analytics':1 }. This would mean that every piece of data is replicated 3 times cluster wide. If you really wanted the data to not be copied to the analytics nodes (this would be detrimental to analytics performance) you could set 'Analytics:0' or omit that phrase all together.
Your backup-strategy should always backup at least a full replica of the data but it is most likely easiest to just backup every node or at least every node in one datacenter (as you could bootstrap the others off of it)
The node will only have data if you want it to via the Replication strategy and in this case you will need decommission when removing the node as you would with any node in the cluster. Most users do find it useful to have replicas in their analytics datacenters because this allows for faster access when using various analytics tools.