Search code examples
c#semaphoredotnet-httpclientpollycircuit-breaker

Polly Retry - Pass all execution until a retry is successful


Currently, Polly Retry policy retires all the failed requests independently. So, if there are 10 requests failing and I have set the retry forever policy then it will send 10 more requests every time a retry happens and the server will never heal.

How to asynchronously pass all failed requests and retry only one request and resume the normal flow if a retry is successful?

I can't (don't want to) use Circuit Breaker because my service is a Background worker service and Circuit Breaker breaks the whole background service logic.

// Current code with only retry policy
var retry = HttpPolicyExtensions.HandleTransientHttpError().WaitAndRetryForeverAsync(retryNo => new TimeSpan(0, retryNo > 3 ? 10 : (retryNo * 2), 0));
builder.Services.AddHttpClient<TestClient>().AddPolicyHandler(retry);

Use Case: I have written a background service and continuously scrapes a website that contains 30000+ pages. Inorder to prevent overloading the site, I am using SemaphoreSlim (or Bulkhead) to limit no of requests that send to the server at a point in time.

Still, there is a chance that the server rejects my request. At that time, I need to retry only one failed request unit the server starts accepting my request again. Since I am sending multiple requests at the same time, Polly is retrying all the failed requests, this makes the server unhappy.

Expectation:

10 Request Fails -> Retry 1 request (unit success) -> If successful then resent remaining 9 request.


Solution

  • The Problem

    According to my understanding you have a single HttpClient which is used to issue N rate-limited, concurrent requests against the same downstream system.

    You want to handle the following failure scenarios:

    • If there is a transient network issue you want to retry an individual request
    • If the downstream system gets overloaded (so most of the concurrent requests fail) then you want to back off and use only a single request for probing its healthiness

    Option A - Combine CB and Retry

    The Circuit Breaker policy works as a proxy. It tracks the outgoing communication and if there are too much successive failures then it prevents further requests. It does that by short-cutting the requests by throwing an BrokenCircuitException.

    After a certain period of time CB will allow a single request to go out against the downstream system and if it succeeds then it allows all outgoing communication but if it fails then it will short-cut them. Here I have detailed how does CB work.

    You can adjust your retry policy to be aware of this exception. This means that your retry requests will be still issued but will not leave your application domain. Fortunately in Polly you can define multiple triggers for a policy:

    HttpPolicyExtensions
       .HandleTransientHttpError()
       .Or<BrokenCircuitException>()
       .WaitAndRetryForeverAsync(retryNo => new TimeSpan(0, retryNo > 3 ? 10 : (retryNo * 2), 0));
    

    So, either it was a HttpRequestException or a BrokenCircuitException it will trigger. It will also trigger if the HttpStatusCode is either 408 or 5xx.

    Now what's left is to combine the retry and circuit breaker policies into a resilient strategy. You can do that by using one of the following:

    .AddPolicyHandler(retryPolicy.Wrap(cbPolicy))
    //OR
    .AddPolicyHandler(Policy.Wrap(retryPolicy, cbPolicy))
    

    Please be aware of the ordering. It is important to register the cb as the inner policy and the retry as the outer to be able to rely on escalation. Here I have detailed this exact scenario.

    NOTE: If you want to you can use different delay while the Circuit Breaker is Open. I have detailed here how you can do that by using the Context object.


    Option B - Use a queue

    The above solution works fine if the application does not crash. If it does then you have to start the whole processing from the beginning.

    If you need to avoid this situation then you need to store somewhere your workitems (to-be-processed urls).

    I would suggest the following architecture:

    • Your main worker does not issue http requests against the downstream system rather it creates jobs / workitems
      • It can store the workitems in a database or in a persistent queue
    • There is another worker which fetches the jobs from the database or the queue and tries to execute the requests
      • If a request succeeds then it deletes the workitem from the persistent storage
      • If a request fails then it won't delete the item rather fetches a new one
        • Depending on your requirements you might need to delete the item and push it at the end of the queue << sort of re-queueing it
    • The fetch logic can be aware of the Circuit Breaker state
      • If the CB is Closed then it fetches N jobs
      • If it is Open then it fetches only one

    With this architecture you don't need an explicit retry policy, since your queue/database preserves those items that did not succeed. So your fetch logic would retrieve the same job until it eventually completed.

    You can further extend this concept by creating a dead letter queue where you can store those workitems that are failed N times. With that your queue won't be polluted with "permanent" workitems.