Search code examples
c#asynchronousparallel-processingpollybulkhead

WaitAndRetryPolicy combined with BulkheadPolicy, prioritizing retries. Is it possible?


I am evaluating the Polly library in terms of features and flexibility, and as part of the evaluation process I am trying to combine the WaitAndRetryPolicy with the BulkheadPolicy policies, to achieve a combination of resiliency and throttling. The problem is that the resulting behavior of this combination does not match my expectations and preferences. What I would like is to prioritize the retrying of failed operations over executing fresh/unprocessed operations.

The rationale is that (from my experience) a failed operation has greater chances of failing again. So if all failed operations get pushed to the end of the whole process, that last part of the whole process will be painfully slow and unproductive. Not only because these operations may fail again, but also because of the required delay between each retry, that may need to be progressively longer after each failed attempt. So what I want is that each time the BulkheadPolicy has room for starting a new operation, to choose a retry operation if there is one in its queue.

Here is an example that demonstrates the undesirable behavior I would like to fix. 10 items need to be processed. All fail on their first attempt and succeed on their second attempt, resulting to a total of 20 executions. The waiting period before retrying an item is one second. Only 2 operations should be active at any moment:

var policy = Policy.WrapAsync
(
    Policy
        .Handle<HttpRequestException>()
        .WaitAndRetryAsync(retryCount: 1, _ => TimeSpan.FromSeconds(1)),

    Policy.BulkheadAsync(
        maxParallelization: 2, maxQueuingActions: Int32.MaxValue)
);

var tasks = new List<Task>();
foreach (var item in Enumerable.Range(1, 10))
{
    int attempt = 0;
    tasks.Add(policy.ExecuteAsync(async () =>
    {
        attempt++;
        Console.WriteLine($"{DateTime.Now:HH:mm:ss} Starting #{item}/{attempt}");
        await Task.Delay(1000);
        if (attempt == 1) throw new HttpRequestException();
    }));
}
await Task.WhenAll(tasks);

Output (actual):

09:07:12 Starting #1/1
09:07:12 Starting #2/1
09:07:13 Starting #3/1
09:07:13 Starting #4/1
09:07:14 Starting #5/1
09:07:14 Starting #6/1
09:07:15 Starting #8/1
09:07:15 Starting #7/1
09:07:16 Starting #10/1
09:07:16 Starting #9/1
09:07:17 Starting #2/2
09:07:17 Starting #1/2
09:07:18 Starting #4/2
09:07:18 Starting #3/2
09:07:19 Starting #5/2
09:07:19 Starting #6/2
09:07:20 Starting #7/2
09:07:20 Starting #8/2
09:07:21 Starting #10/2
09:07:21 Starting #9/2

The expected output should be something like this (I wrote it by hand):

09:07:12 Starting #1/1
09:07:12 Starting #2/1
09:07:13 Starting #3/1
09:07:13 Starting #4/1
09:07:14 Starting #1/2
09:07:14 Starting #2/2
09:07:15 Starting #3/2
09:07:15 Starting #4/2
09:07:16 Starting #5/1
09:07:16 Starting #6/1
09:07:17 Starting #7/1
09:07:17 Starting #8/1
09:07:18 Starting #5/2
09:07:18 Starting #6/2
09:07:19 Starting #7/2
09:07:19 Starting #8/2
09:07:20 Starting #9/1
09:07:20 Starting #10/1
09:07:22 Starting #9/2
09:07:22 Starting #10/2

For example at the 09:07:14 mark the 1-second wait period of the failed item #1 has been expired, so its second attempt should be prioritized over doing the first attempt of the item #5.

An unsuccessful attempt to solve this problem is to reverse the order of the two policies. Unfortunately putting the BulkheadPolicy before the WaitAndRetryPolicy results to reduced parallelization. What happens is that the BulkheadPolicy considers all retries of an item to be a singe operation, and so the "wait" phase between two retries counts towards the parallelization limit. Obviously I don't want that. The documentation also makes it clear the the order of the two policies in my example is correct:

BulkheadPolicy: Usually innermost unless wraps a final TimeoutPolicy. Certainly inside any WaitAndRetry. The Bulkhead intentionally limits the parallelization. You want that parallelization devoted to running the delegate, not occupied by waits for a retry.

Is there any way to achieve the behavior I want, while staying in the realm of the Polly library?


Solution

  • I found a simple but not perfect solution to this problem. The solution is to include a second BulkheadPolicy positioned before the WaitAndRetryPolicy (in an "outer" position). This extra Bulkhead will serve only for reprioritizing the workload (by serving as an outer queue), and should have a substantially larger capacity (x10 or more) than the inner Bulkhead that controls the parallelization. The reason is that the outer Bulkhead could also affect (reduce) the parallelization in an unpredictable way, and we don't want that. This is why I consider this solution imperfect, because neither the prioritization is optimal, nor it is guaranteed that the parallelization will not be affected.

    Here is the combined policy of the original example, enhanced with an outer BulkheadPolicy. Its capacity is only 2.5 times larger, which is suitable for this contrived example, but too small for the general case:

    var policy = Policy.WrapAsync
    (
        Policy.BulkheadAsync( // For improving prioritization
            maxParallelization: 5, maxQueuingActions: Int32.MaxValue),
    
        Policy
            .Handle<HttpRequestException>()
            .WaitAndRetryAsync(retryCount: 1, _ => TimeSpan.FromSeconds(1)),
    
        Policy.BulkheadAsync( // For controlling paralellization
            maxParallelization: 2, maxQueuingActions: Int32.MaxValue)
    );
    

    And here is the output of the execution:

    12:36:02 Starting #1/1
    12:36:02 Starting #2/1
    12:36:03 Starting #3/1
    12:36:03 Starting #4/1
    12:36:04 Starting #2/2
    12:36:04 Starting #5/1
    12:36:05 Starting #1/2
    12:36:05 Starting #3/2
    12:36:06 Starting #6/1
    12:36:06 Starting #4/2
    12:36:07 Starting #8/1
    12:36:07 Starting #5/2
    12:36:08 Starting #9/1
    12:36:08 Starting #7/1
    12:36:09 Starting #10/1
    12:36:09 Starting #6/2
    12:36:10 Starting #7/2
    12:36:10 Starting #8/2
    12:36:11 Starting #9/2
    12:36:11 Starting #10/2
    

    Although this solution is not perfect, I believe that it should do more good than harm in the general case, and should result in a better performance overall.