Stale data on secondary during addition of new member with majority concern

Given a basic replicaSet composed of 1 primary + 1 secondary + 1 arbiter, with mongo 4.0 and majority enabled, we ran into a strange issue we cannot explain yet

We wanted to add a new secondary to the replicaSet. So we started a new VM, installed and configured mongod and then added the node to the replicaSet. The new member appeared with the status STARTED2 and was synchronizing with the existing cluster. So far so good.

However we noticed something : one of our application that reads from secondaries (readPref: secondaryPreferred, readConcern: majority) was reading stale data (from the date the synchronization started). And looking at rs.status(), indeed the lastCommittedOpTime was stuck in the past. As expected with this kind of behavior, the wiredtiger cache usage from the primary started to increase to reach the 15~20% zone and started to slow down the primary.

We ended-up solving the issue by declaring the member as hidden while it's synchronizying, but : why did it happen?

My understanding is that the data is not committed into the "main" zone until a majority of members acknowledged the said data. But with 3 data members (primary+secondary +new secondary) the majority should have been met with the existing members. Why did the addition of the new member cause this behavior?

Solution

"Majority" means majority of the voting nodes.

In the original cluster the Primary, Secondary, and Arbiter each had 1 vote, so the majority was 2. A write was considered majority committed as soon as it was written to the primary and secondary.

Once a new node was added, there were 4 votes, Primary, original Secondary, new Secondary, Arbiter.

This means that 2 was no longer a majority, it was only half. Therefore, in order to be considered majority committed, it would have to be written to 3 voting nodes, i.e., the primary and both secondaries.

Each replicated node keeps the operations log in the same order, and notes the most recent operation that is known to have been committed to a majority of the nodes. This is called the majority commit point. Any read that requests majority read concern will use a snapshot of the data as of the majority commit point.

Once the new node was added, majority writes would not be able the complete until after it had finished initial sync and begun to apply the oplog. Until that point, all majority reads would be as of the most recent majority commit point, right before that node was added.

Simple fix: Remove the arbiter. This will return the cluster to 3 voting nodes, and majority writes will be able to complete with only the primary and one secondary acknowledging the write.