Search code examples
c#dropbox-apipollyretry-logic

Polly.Contrib.WaitAndRetry to "funnel" all requests when hitting rate limit


We're using the Dropbox API wrapped in Polly to handle retries.
We have it set up as an exponential back-off, like explained here.

The issue we have is that we make plenty of concurrent calls.
When the API starts throwing rate limit exceptions, each individual caller backs off but new callers will still call the API and "steal" the retry of callers that are waiting.
That means that on high load we are experiencing failed API calls and errors.

What we would like to achieve is that on rate limit errors all calls (including new callers) to the API are synchronized and wait for the rate limit to expire.
Then calls can resume (ideally in sequence to make sure the calls don't return rate limit exceptions anymore).

Is there a Polly-supported way of achieving that?


Solution

  • According to my understanding you want to have the following:

    1. The downstream system can throttle incoming requests
      1.1 The system is smart enough to provide a RetryAfter time span
    2. You want to avoid flooding the downstream system if you already know that you are throttled
    3. But you don't want to lose any incoming request rather prefer processing all of them eventually

    Let's put together a working example

    #1 - Downstream system

    Here we will implement a super simple mock which can mimic throttling.

    Let's start with the exception

    public class DownstreamServiceException: Exception
    {
        public TimeSpan RetryAfter { get; set; }
    }
    

    Now, let's see the service code

    public class DownstreamService
    {
        private readonly CancellationTokenSource initCompletionSignal;
        private readonly TimeSpan initDuration;
        private bool isAvailable = false;
        private DateTime initEstimatedEnd;
    
        public DownstreamService()
        {
            initDuration = TimeSpan.FromSeconds(10);
            initCompletionSignal = new CancellationTokenSource(initDuration);
            initCompletionSignal.Token.Register(() => isAvailable = true);
            initEstimatedEnd = DateTime.UtcNow.Add(initDuration);
        }
    
        public Task<string> GetAsync()
        {
            if (!isAvailable) throw new DownstreamServiceException { RetryAfter = initEstimatedEnd - DateTime.UtcNow };
            return Task.FromResult("Available");
        }
    }
    
    • For the sake of simplicity I've used made the service unavailable for the first 10 seconds
    • I've used a CancellationTokenSource as a timer to make the service available
    • If the GetAsync is called while it is not available (we are throttled) it returns an exception otherwise with the "Available" string

    #2 - Avoid flooding is downstream is not available

    Here we will define a Circuit Breaker to short-cut the requests if the downstream is not available (we are throttled)

    var throttledPolicy = Policy<string>
        .Handle<DownstreamServiceException>()
        .CircuitBreakerAsync(1, TimeSpan.FromSeconds(0),
            onBreak: (result, state, _, __) => {
                if (state == CircuitState.Open) return;
                Console.WriteLine("onBreak");
                throw result.Exception;
            },
            onReset: (_) => Console.WriteLine("onReset"),
            onHalfOpen: () => { });
    
    • The Circuit Breaker will transit from Closed to Open when we receive the first DownstreamServiceException
    • The duration of break (TimeSpan.FromSeconds(0)) does not matter here
      • We will control the Circuit Breaker's state from the Retry logic
    • if (state == CircuitState.Open): This will be explained under the retry section
    • And finally re-throw the original exception (I know, I know ... it should be avoided, but it keeps our example application simple)

    #3 - Retry until eventually processed

    This is the most complicated part of the solution, because this retry policy handles multiple exceptions (DownstreamServiceException, IsolatedCircuitException) in a different way

    CancellationTokenSource throttlingEndSignal;
    var retryPolicy = Policy<string>
        .Handle<DownstreamServiceException>()
        .Or<IsolatedCircuitException>()
        .WaitAndRetryForeverAsync(_ => TimeSpan.FromSeconds(3),
            onRetry: (dr, __) =>
            {
                Console.WriteLine($"onRetry caused by {dr.Exception.GetType().Name}");
                if (dr.Exception is DownstreamServiceException dse)
                {
                    throttledPolicy.Isolate();
                    throttlingEndSignal = new(dse.RetryAfter);
                    throttlingEndSignal.Token.Register(() => throttledPolicy.Reset());
                }
            });
    
    • Let's start with the DownstreamServiceException
      • We will receive this exception because we are going to chain together the two policies and Circuit Breaker's onBreak delegate re-throws the received exception
      • Inside the onRetry we have a guard expression for DownstreamServiceException
      • Here we call the Isolate on the Circuit Breaker, which tries to transit from Open state to Isolated state >> calls the onBreak delegate
      • In order to avoid infinite loop that's why we had this if (state == CircuitState.Open) return; code there
      • We do the same timer trick here with the CancellationTokenSource, when ever the throttling ends we push the Circuit Breaker back to Closed state (Reset)
    • The IsolatedCircuitException case is much more simple
      • We receive this exception whenever we tries to perform a retry attempt but the Circuit Breaker is in Isolated state
      • So, the CB short cuts the execution and because of WaitAndRetryForever call we will eventually succeed

    Put things together

    var combinedPolicy = Policy.WrapAsync(retryPolicy, throttledPolicy);
    
    var result = await combinedPolicy.ExecuteAsync(async () => await service.GetAsync());
    

    Please note the followings:

    • This solution works well with multiple requests as well because Circuit Breaker is shared
    • This solution is a workaround, because we ca not set the duration of break dynamically

    I hope you found this little sample application useful :)