mongodb time replication database-replication

MongoDB SECONDARY becoming RECOVERING at nighttime

I am running a conventional MongoDB Replica Set consisting of 3 members (member1 in datacenter A, member2 and member3 in datacenter B). member1 is the current PRIMARY and I am adding members 2 and 3 via rs.add(). They are performing their initial sync and become SECONDARY very soon. Everything is fine all day long and the replication delay of both members is 0 seconds until 2 AM at nighttime.

Now: Every night at 2 AM both members shift into the RECOVERING state and stop replication at all, which leads to a replication delay of hours when I am having a look into rs.printSlaveReplicationInfo() in the morning hours. At around 2 AM there are no massive inserts or maintenance tasks known to me.

I get the following log entries on the PRIMARY:

2015-10-09T01:59:38.914+0200 [initandlisten] connection accepted from 192.168.227.209:59905 #11954 (37 connections now open)
2015-10-09T01:59:55.751+0200 [conn11111] warning: Collection dropped or state deleted during yield of CollectionScan
2015-10-09T01:59:55.869+0200 [conn11111] warning: Collection dropped or state deleted during yield of CollectionScan
2015-10-09T01:59:55.870+0200 [conn11111] getmore local.oplog.rs cursorid:1155433944036 ntoreturn:0 keyUpdates:0 numYields:1 locks(micros) r:32168 nreturned:0 reslen:20 134ms
2015-10-09T01:59:55.872+0200 [conn11111] end connection 192.168.227.209:58972 (36 connections now open)

And, which is more interesting, I get the following log entries on both SECONDARYs:

2015-10-09T01:59:55.873+0200 [rsBackgroundSync] repl: old cursor isDead, will initiate a new one
2015-10-09T01:59:55.873+0200 [rsBackgroundSync] replSet syncing to: member1:27017
2015-10-09T01:59:56.065+0200 [rsBackgroundSync] replSet error RS102 too stale to catch up, at least from member1:27017
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet our last optime : Oct  9 01:59:23 5617035b:17f
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet oldest at member1:27017 : Oct  9 01:59:23 5617035b:1af
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet error RS102 too stale to catch up
2015-10-09T01:59:56.066+0200 [rsBackgroundSync] replSet RECOVERING

Which is also striking - the start of the oplog "resets" itself every night at around 2 AM:

configured oplog size:   990MB
log length start to end: 19485secs (5.41hrs)
oplog first event time:  Fri Oct 09 2015 02:00:33 GMT+0200 (CEST)
oplog last event time:   Fri Oct 09 2015 07:25:18 GMT+0200 (CEST)
now:                     Fri Oct 09 2015 07:25:26 GMT+0200 (CEST)

I am not sure if this is somehow correlated to the issue. I am also wondering that such a small delay (Oct 9 01:59:23 5617035b:17f <-> Oct 9 01:59:23 5617035b:1af) lets the members become stale.

Could this also be a server (VM host) time issue or is it something completely different? (Why is the first oplog event being "resetted" every night and not "shifting" to a timestamp like NOW minus 24 hrs?) What can I do to investigate and to avoid?

Solution

Upping the oplog size should solve this (per our comments).

Some references for others who run into this issue

Workloads that Might Require a Larger Oplog Size
Error: replSet error RS102 too stale to catch up link1 & link2