Search code examples
c#azure.net-coreazure-cosmosdb

How to control timeouts, retries and delays between retries in the Cosmos SDK?


I am using the recommended SDK - Microsoft.Azure.Cosmos

I want to put a limit on the amount of time that Cosmos takes to create a document. So, for example, I want all of the timeouts, delays between retries and retries to be completed in 5s in the worst case. That is when all of the retries time out.

I am able to do this with the Service Bus SDK and when I call other APIs I can wrap things in Polly and easily control these parameters.

The reason I want to do this is because the caller of my service will time out after a set time. At that time it will cancel the request. I would like to configure my application so that the majority of the time it completes all retries before the request is canceled. Obviously, I can propagate a cancellation token but that is not as tidy as getting all the work done within a given window.

Most of the retry options in the Cosmos SDK are around the specific case of throttling. I am also conscious that some requests will be sent over HTTP and some will be sent over TCP in direct mode.

Example of code I'm using in Program.cs in a .NET 8 project to create the Cosmos client:

CosmosClient cosmosClient = new(
    connectionString: "<connection-string-goes-here>",
    new CosmosClientOptions
    {
        ApplicationPreferredRegions = new List<string> { Regions.SoutheastAsia }
        // I would expect there to be options here for controlling timeouts and retries
    }
);

Solution

  • Reference: https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-dotnet-sdk-request-timeout?tabs=cpu-new#customize-the-timeout-on-the-azure-cosmos-db-net-sdk

    CancellationToken

    All the async operations in the SDK have an optional CancellationToken parameter. This CancellationToken parameter is used throughout the entire operation, across all network requests and retries. In between network requests, the cancellation token might be checked and an operation canceled if the related token is expired. The cancellation token should be used to define an approximate expected timeout on the operation scope.

    Sounds like what you are looking for is a CancellationToken.

    Be advised however that in order to have an ideal behavior, the CancellationToken should be higher than any configured timeouts (from the same document):

    ... the configured time in your CancellationToken, make sure that it's greater than your RequestTimeout and the CosmosClientOptions.OpenTcpConnectionTimeout (if you're using Direct mode).

    For example, if your CancellationToken is 5 seconds, the RequestTimeout and OpenTcpConnectionTimeout (if using Direct mode) should be lower.

    Also remember that operations in Cosmos DB can take up to 5 seconds to execute. A 5 seconds end to end timeout (depending on the workload) might be incorrect (if you are doing point operations it might be fine but queries might take longer).