Search code examples
c#.netmongodbpollyresiliency

How to decide what exceptions are worth retrying when reading and writing to MongoDB (C# driver)?


By looking at this official documentation it seems that there are basically three types of errors thrown by the MongoDB C# driver:

  • errors thrown when the driver is not able to properly select or connect to a Server to issue the query against. These errors lead to a TimeoutException
  • errors thrown when the driver has successfully selected a Server to run the query against, but the server goes down while the query is being executed. These errors manifest themselves as MongoConnectionException
  • errors thrown during a write operations. These errors leads to MongoWriteException or MongoBulkWriteException depending on the type of write operation being performed.

I'm trying to make my software using MongoDB a bit more resilient to transient errors, so I want to find which exceptions are worth retry.

The problem is not implementing a solid retry policy (I usually employ Polly .NET for that), but instead understanding when the retry makes sense.

I think that retrying on exceptions of type TimeoutException doesn't make sense, because the driver itself waits for a few seconds before timing out an operation (the default is 30 seconds, but you can change that via the connection string options). The idea is that retry the operation after you have waited for 30 seconds before timing out is probably a waste of time. For instance if you decide to implement 3 retries with 1 second of waiting time between them, it takes up to 93 seconds to fail an operation (30 + 30 + 30 + 1 + 1 + 1). This is a huge time.

As documented here retrying on MongoConnectionException is only safe when doing idempotent operations. From my point of view, it makes sense to always retry on these kind of errors provided that the performed operation is idempotent.

The hard bit in deciding a good retry strategy for writes is when you get an exception of type MongoWriteException or MongoBulkWriteException.

Regarding the exceptions of type MongoWriteException is probably worth retrying all the exceptions having a ServerErrorCategory other than DuplicateKey. As documented here you can detect the duplicate key errors by using this property of the MongoWriteException.WriteError object.

Retrying duplicate key errors probably doesn't make sense because you will get them again (that's not a transient error).

I have no idea how to handle errors of type MongoBulkWriteException safely. In that case you are inserting multiple documents to MongoDB and it is entirely possible that only some of them have failed, while the others have been successfully written to MongoDB. So retrying the exact same bulk insert operation could lead to write the same document twice (bulk writes are not idempotent in nature). How can I handle this scenario ?

Do you have any suggestion ?

Do you know any working example or reference regarding retrying queries on MongoDB for the C# driver ?


Solution

  • Retry

    Let's start with the basics of Retry.

    There are situation where your requested operation relies on a resource, which might not be reachable in a certain point of time. In other words there can be a temporal issue, which will vanish sooner or later. This sort of issues can cause transient failures. With retries you can overcome these problems by attempting to redo the same operation in a specific moment in the future. To be able to use this mechanism the following criteria group should be met:

    • The potentially introduced observable impact is acceptable
    • The operation can be redone without any irreversible side effect
    • The introduced complexity is negligible compared to the promised reliability

    Let’s review them one by one:

    • The word failure indicates that the effect is observable by the requester as well, for example via higher latency / reduced throughput / etc.. If the “penalty“ (delay or reduced performance) is unacceptable then retry is not an option for you.
    • This requirement is also known as idempotent operation. If I call the action with the same input several times then it will produce the exact same result. In other words, the operation acts like it only depends on its parameter and nothing else influences the result (like other objects' state).
    • This condition is even though one of the most crucial, this is the one that is almost always forgotten. As always there are trade-offs (If I introduce Z then it will increase X but it might decrease Y) and we should be fully aware of them. Unless it will give us some unwanted surprises in the least expected time.

    Mongo Exception

    Let's continue with exceptions that the MongoDb's C# client can throw.

    I haven't used MongoDb in last couple of years so this knowledge may have been outdated. But I hope the essence did not change since.

    I would also encourage you to introduce detection logic first (catch and log) before you try to mitigate the problem (for example with retry). This will give information about the frequency and amount of occurrences. It will also give you insight about the nature of the problems.

    • MongoConnectionException with a SocketException as Inner
      • When:
        • There is server selection problem
        • The connection has timed out
        • The chosen server is unavailable
      • Retry:
        • If the problem is due to network issue then it might be useful to retry
        • If the root cause is misconfiguration then retry won't help
      • Log:
    • MongoWriteException or MongoWriteConcernException
      • When:
        • There was a persistence problem
      • Retry:
        • It depends, if you perform a create operation and the server can detect duplicates (DuplicateKeyError) then it is better to try to write the record multiple times then have one failed write attempt
        • Most of the time updates are not idempotent but if you use some sort of record versioning then you can try to perform a retry and fail during the optimistic locking
        • Deletion could be implemented in an idempotent way. This is true for soft and hard delete as well.
      • Log: