Abort WaitAndRetryAsync policy?

I'd like to use WaitAndRetryAsync to help retry http 429 (throttling) errors. The retry delay is returned as a property on the exception itself. But I need to add the accumulated time and abandon the retry loop if the overall duration exceeds a certain amount.

policy = Policy.Handle<DocumentClientException>(ex => ex.StatusCode == (HttpStatusCode)429)
    .WaitAndRetryAsync(
        retryCount: retries,
        sleepDurationProvider: (retryCount, exception, context) => {
            DocumentClientException dce = exception as DocumentClientException;

            // Here I would like to check the total time and NOT return a RetryAfter value if my overall time is exceeded. Instead re-throw the 'exception'.

            return dce.RetryAfter;
    },
        onRetryAsync: async (res, timespan, retryCount, context) => {
    });

When the overall time is exceeded I'd like to re-throw the 'exception' handled in the sleepDurationProvider.

Is there a better way to handle this?

Solution

This first example below limits the total waits between retries to a total timespan myWaitLimit, but takes no account of how long the calls to CosmosDB spend before returning DocumentClientException. Because Polly Context is execution-scoped, this is thread-safe. Something like:

policy = Policy.Handle<DocumentClientException>(ex => ex.StatusCode == (HttpStatusCode)429)
.WaitAndRetryAsync(
    retryCount: retries,
    sleepDurationProvider: (retryCount, exception, context) => {
        DocumentClientException dce = exception as DocumentClientException;

        TimeSpan toWait = dce.RetryAfter;
        TimeSpan waitedSoFar;
        if (!Context.TryGetValue("WaitedSoFar", out waitedSoFar)) waitedSoFar = TimeSpan.Zero; // (probably some extra casting actually needed between object and TimeSpan, but this kind of idea ...)
        waitedSoFar = waitedSoFar + toWait;

        if (waitedSoFar > myWaitLimit)
            throw dce; // or use ExceptionDispatchInfo to preserve stack trace

        Context["WaitedSoFar"] = waitedSoFar; // (magic string "WaitedSoFar" only for readability; of course you can factor this out)
        return toWait;
    },
    onRetryAsync: async (res, timespan, retryCount, context) => {
});

An alternative approach could limit the overall execution time (when 429s occur) using a timing-out CancellationToken. The below approach will not retry further after the CancellationToken has been signalled. This approach is modelled to be close to the functionality requested in the question, but the timeout clearly only takes effect if a 429 response is returned and the sleepDurationProvider delegate is invoked.

CancellationTokenSource cts = new CancellationTokenSource();
cts.CancelAfter(/* my timeout */);

var policy = Policy.Handle<DocumentClientException>(ex => ex.StatusCode == (HttpStatusCode)429)
.WaitAndRetryAsync(
    retryCount: retries,
    sleepDurationProvider: (retryCount, exception, context) => {
        if (cts.IsCancellationRequested) throw exception; // or use ExceptionDispatchInfo to preserve stack trace

        DocumentClientException dce = exception as DocumentClientException;
        return dce.RetryAfter;
    },
    onRetryAsync: async (res, timespan, retryCount, context) => {
});

If you don't wish to define policy in the same scope as using it and close over the variable cts (as the above example does), you can pass the CancellationTokenSource around using Polly Context as described in this blog post.

Alternatively, Polly provides a TimeoutPolicy. Using PolicyWrap you can wrap this outside the retry policy. A timeout can then be imposed on the overall execution whether a 429 occurs or not.

If the strategy is intended to manage Cosmos DB async calls which do not inherently take a CancellationToken, you would need to use TimeoutStrategy.Pessimistic if you wanted to enforce timeout at that time interval. However, note from the wiki how TimeoutStrategy.Pessimistic operates: it allows the calling thread to walk away from the uncancellable call, but doesn't unilaterally cancel the uncancellable call. That call might either later fault, or continue to completion.

Obviously, consider what is best from among the above options, according to your context.