c#.net polly circuit-breaker retry-logic

Should I implement both Retry Policy and Circuit Breaker on nested methods calling external resources?

I have a multi-layered application where Method1 calls Method2. Inside Method2, I have interactions with external resources like Redis and Event Hub. I have already implemented a Retry policy on Method1.

Now, I'm considering implementing a Circuit Breaker pattern for better fault tolerance. My primary concern is where to implement this Circuit Breaker pattern. Should it be at the level of Method2 where the external resources are being accessed?

Additionally, should I implement another Retry policy specifically in Method2, or would that be overkill since there's already a Retry policy in Main calling Method1?

Here's a simplified outline of the call hierarchy for context:

using Polly;
using Polly.CircuitBreaker;
using System;

class Program
{
    // Define Circuit Breaker Policy as a static member
    private static readonly CircuitBreakerPolicy circuitBreakerPolicy = Policy
        .Handle<Exception>()
        .CircuitBreaker(2, TimeSpan.FromMinutes(1));

    static void Main(string[] args)
    {
        // Define Retry Policy
        var retryPolicy = Policy
            .Handle<Exception>()
            .Retry(3);

        // Wrap the Retry policy around Method1
        try
        {
            retryPolicy.Execute(Method1);
        }
        catch (Exception e)
        {
            Console.WriteLine($"Failed to execute Method1. Reason: {e.Message}");
        }
    }

    static void Method1()
    {
        Console.WriteLine("Executing Method1");
        //Other calls to external resources
        Method2();
    }

    static void Method2()
    {
        Console.WriteLine("Executing Method2");

        circuitBreakerPolicy.Execute(() =>
        {
            // Simulate external resource call
            Console.WriteLine("Calling external resources like Redis, Event Hub etc.");
            
            // Uncomment to simulate failure
            // throw new Exception("Simulated external resource failure");
        });
    }
}

I would love to hear some best practices or experiences on how to approach this issue.

Solution

Circuit Breaker

This resiliency pattern is used to prevent overloading an already struggling service.

It detects that service is having hard time by counting either successive/consecutive failures (done by the normal Circuit Breaker) or by counting failures during a sampling period in case of fluctuating load (done by the Advanced Circuit Breaker).

If the downstream is considered unhealthy then it will not allow new requests and will short cut them with a BrokenCircuitException.

The key point here is that you should define a CB per downstream service. If your methods are talking with multiple services then you should have multiple circuit breakers as well. Which makes sense since

they might indicate unhealthiness differently
their failure thresholds might be different
their incoming request rate and volume might be different
etc..

So, having a single shared CB is not enough in your use case.

I'm glad to see that you have defined your circuit breaker as a shared resource. This is important since Circuit Breaker is a stateful policy (here I have detailed how it is achieved). In other words if you would re-create it for each and every external call then you would loose that information that the previous CB has already detected that the downstream is unhealthy.

So, please keep all your circuit breakers as shared and do not create them on-demand.

Retry

As I've indicated in the comments section not all actions are retryable. In order to use the retry you have to be sure that all prerequisites are met. In short:

you should retry only transient failures (it won't help in case of permanent failure like for example Input Validation failed)
you should retry only idempotent actions (it could cause harm if your action produce side-effect(s) and you don't have de-duplication logic in place)
you should retry only if the added delay is acceptable (if there are too many retries or too big sleep duration periods then the whole process might timeout)

Combining the two

I have detailed this topic many times so, please allow me to just add some links here: