Search code examples
couchdbcouchdb-2.0

CouchDB replication ignoring sporadic documents


I've got a CouchDB setup (CouchDB 2.1.1) for my app, which relies heavily on replication integrity. We are using the "one db per user" approach, with an additional layer of "role" db:s that groups users like the image below.

Recently, while increasing the number of beta testers, we discovered that some documents had not been replicated as they should. We are unable to see any pattern in document size, creation/update time, user or other. The errors seem to happen sporadically, with 2-3 successfully replicated docs followed by 4-6 non-replicated docs.

The server responds with {"error":"not_found","reason":"missing"} on those docs.

Most (but not all) of the user documents has been replicated to the corresponding Role DB, but very few made it all the way to the Master DB. This never happened when testing with < 100 documents (now we're at 1000-1200 docs in the db).

I discovered a problem with the "max open files" setting mentioned in the Performance chapter in the docs and fixed it, but the non-replicated documents are still not replicating. If I open a document and save it, it will replicate.

This is my current theory:

  1. The replication process tried to copy new documents when the user went online
  2. The write process failed due to Linux's "max_open_files" peaked
  3. The master DB still thinks the replication was successful
  4. At a later replication, the master DB ignores those old documents and only tries to replicate new ones

Could this be correct? And can I somehow make the CouchDB server "double check" all documents and the integrity of previous replications?

Thank you for your time and any helpful comments!

Couch replication schema


Solution

  • I have experienced something similar in the past - when attempting to replicate documents without sufficient permissions the replication fails as it should do. But when the permissions issue is fixed the documents you attempted to replicate cannot then be replicated, although edit/save on the documents fixes the issue. I wonder if this is due to checkpoints? The CouchDb manual says about the "use_checkpoints" flag:

    Disabling checkpoints is not recommended as CouchDB will scan the Source database’s changes feed from the beginning.

    Though scanning from the beginning sounds like it might fix the problem, so perhaps disabling checkpoints could help. I never got back to that issue at the time so I am afraid this is not a proper answer, just a suggestion.