Search code examples
cassandragossip

How does cassandra gossip protocol and phi_threshold works?


Current setting, cassandra 2.2.5, gossip is 1 second default and phi threshold value is 8. The problem, I am facing is spikes in hints. And one of the reason hints goes up is when node is marked down (gossip has not communicated for phi threshold value).

I read one article, where it say phi threshold value of 8 corresponds to 18 seconds, it will be few seconds here or there. Now I need to understand what is the reason, what is blocking gossip to communicate for 18 seconds. What is the checklist that need to be satisfied for gossip to communicate?


Solution

    • Re: "How does cassandra gossip protocol and phi_threshold works?": Phi is approximated as: phi = (tnow - tLast) / mean and a node is marked down when phi > phi_threshold / 0.434. For your settings (and assuming a mean of 1 [as in the node usually receives the heartbeat 1 second apart]) a node will be marked down if we didn't receive any heartbeats from it for 8 / 0.434 = 18.42 seconds. The paper documenting the algorithm can be found here.

    • Re: "What is the checklist that needs to be satisfied for gossip to communicate?": to me there are a few things:

      • the network: the gossip messages being dropped or the gossip port (7000/7001) being blocked;
      • the nodes themselves: the nodes is busy/unresponsive (i.e. doing GC, doing some heavy load operation) so they don't get to send any/much gossip messages.