Search code examples
redisnosqlquorumsplitbrain

How Redis with sentinels behave after split brain?


Question is about Redis server and sentinels configuration.

There are two subnetworks and I want to have 4 redis servers in total, 2 in each subnet. Since there might be connectivity issue I would like to configure sentinels in order to allow brain split for high availability.

So when connectivity issue happens there would appear two Redis setups which work independently for some time.

Now the question what will happen after connectivity of subnets is restored. Sentinels would detect brain split and two masters? Next they will elect only master and second one would be downgraded to a slave? Data from survived master would be pushed to a downgraded master and he would need to drop all data diff which was gained during connectivity issue?

Can I configure something in order for data to be merged?


Solution

  • There are 2 ways to handle HA in redis - sentinel and redis cluster.

    Sentinel

    If a master is not working as expected, Sentinel can start a failover process where a slave is promoted to master, the other additional slaves are reconfigured to use the new master, and the applications using the Redis server informed about the new address to use when connecting

    Q: Since there might be connectivity issue I would like to configure sentinels in order to allow brain split for high availability

    This is an anti-pattern of using sentinel. Here’s a similar example with even number of nodes explained in the docs

    Example 1: just two Sentinels, DON'T DO THIS

    In the above configuration we created two masters (assuming S2 could failover without authorization) in a perfectly symmetrical way. Clients may write indefinitely to both sides, and there is no way to understand when the partition heals what configuration is the right one, in order to prevent a permanent split brain condition. So please deploy at least three Sentinels in three different boxes always.

    Q: Now the question what will happen after connectivity of subnets is restored. Sentinels would detect brain split and two masters?

    This data will be lost forever since when the partition will heal, the master will be reconfigured as a slave of the new master, discarding its data set.

    Q: Data from survived master would be pushed to a downgraded master and he would need to drop all data diff which was gained during connectivity issue?

    Yes

    Q: Can I configure something in order for data to be merged? You can't, redis would never merge anything

    Redis cluster

    What is this beast for?

    The ability to automatically split your dataset among multiple nodes. The ability to continue operations when a subset of the nodes are experiencing failures or are unable to communicate with the rest of the cluster.

    So it’s basically a multiple writer solution. But it doesn’t support merge operations either

    Redis Cluster design avoids conflicting versions of the same key-value pair in multiple nodes as in the case of the Redis data model this is not always desirable. Values in Redis are often very large; it is common to see lists or sorted sets with millions of elements. Also data types are semantically complex. Transferring and merging these kind of values can be a major bottleneck and/or may require the non-trivial involvement of application-side logic, additional memory to store meta-data, and so forth.

    Back to your scenario

    Quoting from here

    Fundamental things to know about Sentinel before deploying

    You need at least three Sentinel instances for a robust deployment. The three Sentinel instances should be placed into computers or virtual machines that are believed to fail in an independent way. So for example different physical servers or Virtual Machines executed on different availability zones.

    Note you can place sentinels on client machines too - this approach is heavily used in redis demos https://redis.io/topics/sentinel

    You can also go with the cluster solution but it's harder to configure it + it has some limitations on multi-key operations and you'll still need to provide majority of nodes in case one of the subnets goes down to have some sort of HA