Search code examples
c#asp.net-core.net-corepollycircuit-breaker

Unexpected behaviour using nested Retry, and Circuit Breaker policies of Polly.Net


I coded a resilience strategy based on retry, and a circuit-breaker policies. Now working, but with a issue in its behavior.

I noticed when the circuit-breaker is on half-open, and the onBreak() event is being executed again to close the circuit, one additional retry is triggered for the retry policy (this one aside of the health verification for the half-open status).

Let me explain step by step:

I've defined two strongly-typed policies for retry, and circuit-breaker:

static Policy<HttpResponseMessage> customRetryPolicy;
static Policy<HttpResponseMessage> customCircuitBreakerPolicy;

static HttpStatusCode[] httpStatusesToProcess = new HttpStatusCode[]
{
   HttpStatusCode.ServiceUnavailable,  //503
   HttpStatusCode.InternalServerError, //500
};

Retry policy is working this way: two (2) retry per request, waiting five (5) second between each retry. If the internal circuit-breaker is open, must not retry. Retry only for 500, and 503 Http statuses.

customRetryPolicy = Policy<HttpResponseMessage>   

//Not execute a retry if the circuit is open
.Handle<BrokenCircuitException>( x => 
{
    return !(x is BrokenCircuitException);
})

//Stop if some inner exception match with BrokenCircuitException
.OrInner<AggregateException>(x =>
{
    return !(x.InnerException is BrokenCircuitException);
})

//Retry if status are:
.OrResult(x => { return httpStatusesToProcess.Contains(x.StatusCode); })

// Retry request two times, wait 5 seconds between each retry
.WaitAndRetry( 2, retryAttempt => TimeSpan.FromSeconds(5),
    (exception, timeSpan, retryCount, context) =>
    {
        System.Console.WriteLine("Retrying... " + retryCount);
    }
);

Circuit-breaker policy is working in this way: Allow max three (3) failures in a row, next open the circuit for thirty (30) seconds. Open circuit ONLY for HTTP-500.

customCircuitBreakerPolicy = Policy<HttpResponseMessage>

// handling result or exception to execute onBreak delegate
.Handle<AggregateException>(x => 
    { return x.InnerException is HttpRequestException; })

// just break when server error will be InternalServerError
.OrResult(x => { return (int) x.StatusCode == 500; })

// Broken when fail 3 times in a row,
// and hold circuit open for 30 seconds
.CircuitBreaker(3, TimeSpan.FromSeconds(30),
    onBreak: (lastResponse, breakDelay) =>{
        System.Console.WriteLine("\n Circuit broken!");
    },
    onReset: () => {
        System.Console.WriteLine("\n Circuit Reset!");
    },
    onHalfOpen: () => {
        System.Console.WriteLine("\n Circuit is Half-Open");
    }); 

Finally, those two policies are being nested this way:

try
{
    customRetryPolicy.Execute(() =>
    customCircuitBreakerPolicy.Execute(() => {
       
       //for testing purposes "api/values", is returning 500 all time
        HttpResponseMessage msResponse
            = GetHttpResponseAsync("api/values").Result;
        
        // This just print messages on console, no pay attention
        PrintHttpResponseAsync(msResponse); 
        
        return msResponse;

   }));
}
catch (BrokenCircuitException e)
{
    System.Console.WriteLine("CB Error: " + e.Message);
}

What is the result that I expected?

  1. The first server responses is HTTP-500 (as expected)
  2. Retry#1, Failed (as expected)
  3. Retry#2, Failed (as expected)
  4. As we have three faults, circuit-breaker is now open (as expected)
  5. GREAT! This is working, perfectly!
  6. Circuit-breaker is open for the next thirty (30) seconds (as expected)
  7. Thirty seconds later, circuit-breaker is half-open (as expected)
  8. One attempt to check the endpoint health (as expected)
  9. Server response is a HTTP-500 (as expected)
  10. Circuit-breaker is open for the next thirty (30) seconds (as expected)
  11. HERE THE PROBLEM: an additional retry is launched when the circuit-breaker is already open!

Look at the images:

enter image description here

enter image description here

enter image description here

I'm trying to understand this behavior. Why one additional retry is being executed when the circuit-breaker is open for second, third,..., N time?

I've reviewed the machine state model for the retry, and circuit-breaker policies, but I don't understand why this additional retry is being performed.

Flow for the circuit-breaker: https://github.com/App-vNext/Polly/wiki/Circuit-Breaker#putting-it-all-together-

Flow for the retry policy: https://github.com/App-vNext/Polly/wiki/Retry#how-polly-retry-works

This really matter, because the time for the retry is being awaited (5 seconds for this example), and at the end, this is a waste of time for high-concurrency.

Any help / direction, will be appreciated. Many thanks.


Solution

  • With Polly.Context you can exchange information between the two policies (in your case: Retry and Circuit Breaker). The Context is basically a Dictionary<string, object>.

    So, the trick is to set a key on the onBreak then use that value inside the sleepDurationProdiver.

    Let's start with inner Circuit Breaker policy:

    static IAsyncPolicy<HttpResponseMessage> GetCircuitBreakerPolicy()
    {
        return Policy<HttpResponseMessage>
            .HandleResult(res => res.StatusCode == HttpStatusCode.InternalServerError)
            .CircuitBreakerAsync(3, TimeSpan.FromSeconds(2),
               onBreak: (dr, ts, ctx) => { ctx[SleepDurationKey] = ts; },
               onReset: (ctx) => { ctx[SleepDurationKey] = null; });
    }
    
    • It breaks after 3 subsequent failed requests
    • It stays in Open state for 2 seconds before it transits to HalfOpen
    • It sets a key on the context with the durationOfBreak value
    • When the CB goes back to "normal" Closed state (onReset) it deletes this value

    Now, let's continue with the Retry policy:

    static IAsyncPolicy<HttpResponseMessage> GetRetryPolicy()
    {
        return Policy<HttpResponseMessage>
        .HandleResult(res => res.StatusCode == HttpStatusCode.InternalServerError)
        .Or<BrokenCircuitException>()
        .WaitAndRetryAsync(4,
            sleepDurationProvider: (c, ctx) =>
            {
                if (ctx.ContainsKey(SleepDurationKey))
                    return (TimeSpan)ctx[SleepDurationKey];
                return TimeSpan.FromMilliseconds(200);
            },
            onRetry: (dr, ts, ctx) =>
            {
                Console.WriteLine($"Context: {(ctx.ContainsKey(SleepDurationKey) ? "Open" : "Closed")}");
                Console.WriteLine($"Waits: {ts.TotalMilliseconds}");
            });
    }
    
    • It triggers either when the StatusCode is 500
      • or when there was an BrokenCircuitException
    • It triggers at most 4 times (so, all in all 5 attempts)
    • It sets the sleep duration based on the context
      • If the key is not present in the context (CB is in Open state) then it returns with 200 milliseconds
      • If the key is present in the context (CB is not in Open state) then it returns with value from the context
        • NOTE: You can add a few hundred milliseconds to this value to avoid race condition
    • It prints some values to the console inside onRetry only for debugging purposes

    Finally let's wire up the policies and test it

    const string SleepDurationKey = "Broken"; 
    static HttpClient client = new HttpClient();
    static async Task Main()
    {
        var strategy = Policy.WrapAsync(GetRetryPolicy(), GetCircuitBreakerPolicy());
        await strategy.ExecuteAsync(async () => await Get());
    }
    
    static Task<HttpResponseMessage> Get()
    {
        return client.GetAsync("https://httpstat.us/500");
    }
    
    • It uses the http://httpstat.us website to emulate an overloaded downstream
    • It combines/chains the two policies (CB inner, Retry outer)
    • It calls the Get method in an asynchronous way

    when handledEventsAllowedBeforeBreaking is 2

    The output

    Context: Closed
    Waits: 200
    Context: Open
    Waits: 2000
    Context: Open
    Waits: 2000
    Context: Open
    Waits: 2000
    

    when handledEventsAllowedBeforeBreaking is 3

    The output

    Context: Closed
    Waits: 200
    Context: Closed
    Waits: 200
    Context: Open
    Waits: 2000
    Context: Open
    Waits: 2000
    

    when handledEventsAllowedBeforeBreaking is 4

    The output

    Context: Closed
    Waits: 200
    Context: Closed
    Waits: 200
    Context: Closed
    Waits: 200
    Context: Open
    Waits: 2000