Search code examples
dockercassandrareplicationcassandra-3.0

Can I have some keyspaces replicated to some nodes?


I am trying to build multiple API for which I want to store the data with Cassandra. I am designing it as if I would have multiple hosts but, the hosts I envisioned would be of two types: trusted and non-trusted.

Because of that I have certain data which I don't want to end up replicated on a group of the hosts but the rest of the data to be replicated everywhere.

I considered simply making a node for public data and one for protected data but that would require the trusted hosts to run two nodes and it would also complicate the way the API interacts with the data.

I am building it in a docker container also, I expect that there will be frequent node creation/destruction both trusted and not trusted.

I want to know if it is possible to use keyspaces in order to achieve my required replication strategy.


Solution

  • You could have two Datacenters one having your public data and the other the private data. You can configure keyspace replication to only replicate that data to one (or both) DCs. See the docs on replication for NetworkTopologyStrategy

    However there are security concerns here since all the nodes need to be able to reach one another via the gossip protocol and also your client applications might need to contact both DCs for different reads and writes.

    I would suggest you look into configuring security perhaps SSL for starters and then perhaps internal authentication. Note Kerberos is also supported but this might be too complex for what you need at least now.

    You may also consider taking a look at the firewall docs to see what ports are used between nodes and from clients so you know which ones to lock down.

    Finally as the above poster mentions, the destruction / creation of nodes too often is not good practice. Cassandra is designed to be able to grow / shrink your cluster while running, but it can be a costly operation as it involves not only streaming data from / to the node being removed / added but also other nodes shuffling around token ranges to rebalance.

    You can run nodes in docker containers, however note you need to take care not to do things like several containers all accessing the same physical resources. Cassandra is quite sensitive to io latency for example, several containers sharing the same physical disk might render performance problems.