Search code examples
databasecassandrarhelcassandra-2.0

How Cassandra Cluster - Seed Provider Works?


I have a doubt on the cassandra seed_provider assignment. In my environment, there are 3 cassandra nodes required to setup as cluster. How should I define it in the cassandra.yaml? I'm confused since most of the tutorials gave different answers.

Example: Host A - 192.168.1.1 Host B - 192.168.1.2 Host C - 192.168.1.3

The following is my current setup for Host A, is it correct?

What about the configuration for Host B & Host C?

# any class that implements the SeedProvider interface and has a
# constructor that takes a Map<String, String> of parameters will do.
seed_provider:
    # Addresses of hosts that are deemed contact points. 
    # Cassandra nodes use this list of hosts to find each other and learn
    # the topology of the ring.  You must change this if you are running
    # multiple nodes!
    - class_name: org.apache.cassandra.locator.SimpleSeedProvider
      parameters:
          # seeds is actually a comma-delimited list of addresses.
          # Ex: "<ip1>,<ip2>,<ip3>"
          - seeds: "192.168.1.1,192.168.1.2,192.168.1.3"

Solution

  • For starters, you should not need to change the class_name of the seed_provider. AFAIK, there is only one that ships with Cassandra. It was defined to be "pluggable," to allow for custom seed providers to be written.

    For seeds, I don't recommend designating every node in the seed list. If there are only 3 nodes, then just provide 1 or 2. Seed nodes do not bootstrap data, and require a repair to get consistent upon replacement. This can make node adds difficult.

    But as far as I see, your current config will work. I would just build the seed list with a max of 2 nodes.

    Just remember, that there are two main requirements for the seed_list:

    • If you are starting the first node in the cluster, its IP must be in the seed_list.
    • At least one of the nodes must be running.

    Do you mind further explain on what's the impact if I proceed to add all 3 nodes in the seed list? What are the reasons that you will only choose to add 1 or 2 nodes in seed list?

    Sure, it all goes back to this:

    Seed nodes do not bootstrap data

    Therefore, designating all 3 nodes in the seed_list on all 3 nodes allows for the following problems:

    • If node A is started and data is written to it before nodes B or C are joined to the cluster, that data will not stream to the nodes B or C.
    • If in the future, node A fails and is replaced, data will not stream to the replacement node.

    In these cases, a nodetool repair will need to be run to get the initial data on to the newly added node.