Search code examples
cassandradatastax-enterprisefailoveropscenter

Cassandra improve failover time


We are using a 3-node cassandra cluster (each node on a different vm) and currently investigating failover times during write and read operations in case one of the nodes dies. Failover times are pretty good when shutting down one node gracefully, however, when killing a node (by shutting down the VM) the latency during the tests is about 12 seconds. I guess this has something to do with the tcp timeout?

Is there any way to tweak this?

Edit: At the moment we are using Cassandra Version 2.0.10. We are using the java client driver, version 2.1.9.

To describe the situation in more detail: The write/read operations are performed using the QUROUM Consistency Level with a replication factor of 3. The cluster consists of 3 nodes (c1,c2,c3), each on a different host (VM). The client driver is connected to c1. During the tests I shutdown the host for c2. From then on we observe that the client is blocking for > 12 seconds, until the other nodes realize that c2 is gone. So i think this is not a client issue, since the client is connected to node c1, which is still running in this scenario.

Edit: I don't believe that the fact that running cassandra inside a VM affects the network stack. In fact, killing the VM has the effect, that the TCP connections are not terminated. In this case, a remote host can notice this only through some time out mechanism (either a timeout on the application level protocol or a TCP timeout). If the process is killed on OS level, the TCP stack of the OS will take care of terminating the TCP connection (IMHO with a TCP reset) enabling a remote host to immediately be notified about the failure.  However, it might be important that even in situations, where a host crashes due to a hardware failure, or where a host is disconnected due to an unplugged network cable (in both cases the TCP connection will not be terminated immediately) the failover time is low.  I've tried to sigkill the cassandra process inside the VM. In this case the failover time is about 600ms, which is fine.

kind regards


Solution

  • Failover times are pretty good when shutting down one node gracefully, however, when killing a node (by shutting down the VM) the latency during the tests is about 12 seconds

    12 secs is a pretty huge value. Some questions before investigating further

    Edit: for a node to be marked down, the gossip protocol is using the phi accrual detector. Instead of having a binary state (UP/DOWN), the algorithm adjust the suspicion level and if the value is above a threshold, the node is considered down

    This algorithm is necessary to avoid marking down a node because of a micro network issue.

    Look in the cassandra.yaml file for this config:

    # phi value that must be reached for a host to be marked down.
    # most users should never need to adjust this.
    # phi_convict_threshold: 8
    

    Another question is: what load balancing strategy are you using from the driver ? And did you use the speculative retry policy ?