MongoDB initial sync on a large database

We are using a MongoDB replica set with three nodes. The database is quite large 2+ billion records and occupies 700GB on a disk (WiredTiger MongoDB engine). Mostly on documents are performed inserts (several millions per day) and after that reads and updates.

After replacing a disk on a secondary member the data folder was empty and initial sync started. By looking at the logs it took about 7 hours to copy records and then 30 hours to build the indexes, but this was way too much for oplog to contain all the records that were inserted/updated in the meantime:

2016-11-16T23:32:03.503+0100 E REPL     [rsBackgroundSync] too stale to catch up -- entering maintenance mode
2016-11-16T23:32:03.503+0100 I REPL     [rsBackgroundSync] our last optime : (term: 46, timestamp: Nov 15 10:03:15:8c)
2016-11-16T23:32:03.503+0100 I REPL     [rsBackgroundSync] oldest available is (term: 46, timestamp: Nov 15 17:37:57:30)
2016-11-16T23:32:03.503+0100 I REPL     [rsBackgroundSync] See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember

First we restarted this member and a re-sync started:

2016-11-16T23:47:22.974+0100 I REPL     [rsSync] initial sync pending
2016-11-16T23:47:22.974+0100 I REPL     [ReplicationExecutor] syncing from: x3:27017
2016-11-16T23:47:23.219+0100 I REPL     [rsSync] initial sync drop all databases
2016-11-16T23:47:23.219+0100 I STORAGE  [rsSync] dropAllDatabasesExceptLocal 5
2016-11-16T23:53:09.014+0100 I REPL     [rsSync] initial sync clone all databases

By looking at the data folder, all the files were erased and they started to grow. But after some 8 hours it barely resynced 5% of the database.

What approach to use for such large syncs?

We thought to increase the oplog size, but that would require a downtime of the entire replica set. What approaches can we use without having a downtime?

Solution

Best solution is use file system snapshot, if possible.

You can snapshot mongod node directly, as long as oplog files are at same diskspace than rest of data files. No need to "shutdown" or do anything else proactive things.

how to make restore with snapshot

Then you just copy those files to new nodes data directory and start mongod.

If file system snapshot is not possible, then some other way to take copy of working mongod data directory, what is easy if you can have downtime. If you cannot have downtime, you can always add few (two) arbiters and stop that other secondary for a moment (to take copy of data dir), of course during that time your replica set is basically "one node RS".