hadoop cassandra datastax-enterprise datastax

DSE 4 Analytics Node ~ Does and should it have Data?

We have always wondered why one of our clusters is showing that an Analytics node owns data. I have edited, ips, tokens, and host ids for readability

% nodetool status

Datacenter: Cassandra
=====================
Status=Up/Down|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Owns   Host ID      Token         Rack
UN  172.32.x.x  46.83 GB   18.5%  someguid     0             rack1
UN  172.32.x.x  60.26 GB   33.3%  anotherguid  ranbignumber  rack1
UN  172.32.x.x  63.51 GB   14.8%  anothergui   ranbignumber  rack1
Datacenter: Analytics
=====================
Status=Up/Down|/ State=Normal/Leaving/Joining/Moving
--  Address     Load       Owns   Host ID   Token          Rack
UN  172.32.x.x  28.91 GB   0.0%   someguid  100            rack1
UN  172.32.x.a  30.41 GB   33.3%  someguid  ranbignumber   rack1
UN  172.32.x.x  17.46 GB   0.0%   someguid  ranbignumber   rack1

So does the Analytics node with ip 172.32.x.a actually own data? If so do we need to back it up? Also would decommissioning the node move the data back into the appropriate nodes?

This is the node that I am referring to from the above nodetool status that is in the Datacenter Analytics:

UN  172.32.x.a  30.41 GB   33.3%  someguid  ranbignumber   rack1

Again the questions (updated with answers provided below).

Do we need to backup this node up? Answer: YES
Should this node have data? Answer: YES, otherwise analytics performance will be impacted.
If it should not have data will nodetool decommission move the data back into the other nodes? Answer: NO replication strategy drives this

Here is the update for

% nodetool status our_important_keyspace

Datacenter: Cassandra
=====================
Status Address     Load       Owns (effective)  
UN     2           63.16 GB   81.5%             
UN     1           47.21 GB   33.3%             
UN     3           59.87 GB   85.2%
Datacenter: Analytics
=====================
Status Address     Load       Owns (effective)
UN     3           17.74 GB   33.3%  
UN     2           30.62 GB   33.3%
UN     1           29.21 GB   33.3%

Backing up Analytics today - awesome answer, and probably saved us a TON of pain.

Solution

The first thing you need to do is run nodetool status or dsetool ring using the keyspace that your data is stored in. This will show you the ownership as dictated by replication strategy of that keyspace. What you are looking at now i s most likely the ownership as set by the raw token values. If your keyspace was named "important_data" you would run "nodetool status important_data".

This replication strategy on your keyspace is key to determining what nodes are responsible for data in your cluster. In any case a multi DC cluster should be using a NetworkTopologyStrategy which allows specifying how many replicas of your data should live in each Datacenter. For example if you wanted to make sure the data was replicated twice in the Cassandra cluster but only once in the Analytics cluster you would use a network topology strategy like, {'Cassandra':2, 'Analytics':1 }. This would mean that every piece of data is replicated 3 times cluster wide. If you really wanted the data to not be copied to the analytics nodes (this would be detrimental to analytics performance) you could set 'Analytics:0' or omit that phrase all together.

Your backup-strategy should always backup at least a full replica of the data but it is most likely easiest to just backup every node or at least every node in one datacenter (as you could bootstrap the others off of it)

The node will only have data if you want it to via the Replication strategy and in this case you will need decommission when removing the node as you would with any node in the cluster. Most users do find it useful to have replicas in their analytics datacenters because this allows for faster access when using various analytics tools.