Search code examples
cachingredisfailover

Redis sentinel failover not working


I've set up Redis sentinel with three servers on ports 7000, 7001 and 7002 (one master and two slaves) and three sentinels on ports 26379, 26380 and 26381 all on same machine (ubuntu VM).

When I start them up, everything looks good according to the logs, and when I run INFO commands against sentinels, looks healthy too. But when I put the master down (make it stop working by Ctrl+C or redis-cli SLEEP command), none of the slave instances are introduced as new master and the sentinels try to nominate and connect to the already dead master instance! My configuration is as follows:

Master:

port 7000      
protected-mode no

Slave #1:

port 7001
slaveof 10.75.196.216 7000

Slave #2:

port 7002
slaveof 10.75.196.216 7000

Sentinel #1:

port 26379
protected-mode no

sentinel myid bdddadb6e825065398be0bae214891d7ccbd6e2a
sentinel monitor themaster 10.75.196.216 7000 2
sentinel down-after-milliseconds themaster 3000
sentinel failover-timeout themaster 5000
sentinel parallel-syncs themaster 2
sentinel config-epoch themaster 0

# Generated by CONFIG REWRITE
dir "/home/bob/app/sentinel-test/master"
sentinel leader-epoch themaster 322
sentinel known-slave themaster 10.75.196.216 7002
sentinel known-slave themaster 10.75.196.216 7001
sentinel known-sentinel themaster 10.75.196.216 26380 181fb84351d6b96e0120bfa68331738ef111c49f
sentinel known-sentinel themaster 10.75.196.216 26381 8497ee90c1e4525c0f957407fefa77427f427e0d
sentinel current-epoch 322

Sentinel #2:

port 26380
protected-mode no

sentinel myid 181fb84351d6b96e0120bfa68331738ef111c49f
sentinel monitor themaster 10.75.196.216 7000 2
sentinel down-after-milliseconds themaster 3000
sentinel failover-timeout themaster 5000
sentinel parallel-syncs themaster 2

# Generated by CONFIG REWRITE
dir "/home/bob/app/sentinel-test/slave1"
sentinel config-epoch themaster 0
sentinel leader-epoch themaster 322
sentinel known-slave themaster 10.75.196.216 7001
sentinel known-slave themaster 10.75.196.216 7002
sentinel known-sentinel themaster 10.75.196.216 26381 8497ee90c1e4525c0f957407fefa77427f427e0d
sentinel known-sentinel themaster 10.75.196.216 26379 bdddadb6e825065398be0bae214891d7ccbd6e2a
sentinel current-epoch 322

Sentinel #3:

port 26381
protected-mode no

sentinel myid 8497ee90c1e4525c0f957407fefa77427f427e0d
sentinel monitor themaster 10.75.196.216 7000 2
sentinel down-after-milliseconds themaster 3000
sentinel failover-timeout themaster 5000
sentinel parallel-syncs themaster 2

# Generated by CONFIG REWRITE
dir "/home/bob/app/sentinel-test/slave2"
sentinel config-epoch themaster 0
sentinel leader-epoch themaster 322
sentinel known-slave themaster 10.75.196.216 7001
sentinel known-slave themaster 10.75.196.216 7002
sentinel known-sentinel themaster 10.75.196.216 26379 bdddadb6e825065398be0bae214891d7ccbd6e2a
sentinel known-sentinel themaster 10.75.196.216 26380 181fb84351d6b96e0120bfa68331738ef111c49f
sentinel current-epoch 322

Master console log: enter image description here

Sentinel #1 console log : enter image description here

Sentinel #1 info command result: enter image description here

Sentinel #1 log after master was down: enter image description here

What's wrong with my configuration?

Thanks in advance.


Solution

  • OK, if you notice the sentinel log, when it starts up, even before the master instance stops working, it says that two slaves are down:

    redis error : +sdown slave

    Probably this is why none of the slaves is good enough to be the new master and we see -failover-abort-no-good-slave error in sentinel log after master is down.

    So, because I remembered I was getting the following error:

    (error) READONLY You can't write against a read only slave

    when I was trying to set keys into slave nodes via redis-cli, I decided to fix this READONLY error by putting the following line in slave config file (both of them):

    slave-read-only no

    And by fixing this part and restarting everything, no +sdown slave error appeared in sentinel log again and the main issue is fixed too. Sentinels are now able to switch to a new slave instance on master down incident.

    As I saw on the internet, another guy had a similar +sdown issue but in his case the problem was authentication.

    I appreciate all people sharing their knowledge and experience. Hope this helps someone.