spring-boot microservices circuit-breaker retry-logic resilience4j-retry

what is the difference between Circuit Breaker and Retry in spring boot microservice?

One of my colleagues asked me this question what the difference between Circuit Breaker and Retry is but I was not able answer him correctly. All I know circuit breaker is useful if there is heavy request payload, but this can be achieve using retry. Then when to use Circuit Breaker and when to Retry.

Also, it is it possible to use both on same API?

Solution

Several years ago I wrote a resilience catalog to describe different mechanisms. Originally I've created this document for co-workers and then I shared it publicly. Please allow me to quote here the relevant parts.

Retry

Categories: reactive, after the fact

The relation between retries and attempts: n retries means at most n+1 attempts. The +1 is the initial request, if it fails (for whatever reason) then retry logic kicks in. In other words, the 0th step is executed with 0 delay penalty.

There are situation where your requested operation relies on a resource, which might not be reachable in a certain point of time. In other words there can be a temporal issue, which will be gone sooner or later. This sort of issues can cause transient failures. With retries you can overcome these problems by attempting to redo the same operation in a specific moment in the future. To be able to use this mechanism the following criteria group should be met:

The potentially introduced observable impact is acceptable
The operation can be redone without any irreversible side effect
The introduced complexity is negligible compared to the promised reliability

Let’s review them one by one:

The word failure indicates that the effect is observable by the requester as well, for example via higher latency / reduced throughput / etc.. If the “penalty“ (delay or reduced performance) is unacceptable then retry is not an option for you.
This requirement is also known as idempotent operation. If I call the action with the same input several times then it will produce the exact same result. In other words, the operation acts like it only depends on its parameter and nothing else influences the result (like other objects' state).
This condition is even though one of the most crucial, this is the one that is almost always forgotten. As always there are trade-offs (If I introduce Z then it will increase X but it might decrease Y).
- We should be fully aware of them otherwise it will give us some unwanted surprises in the least expected time.

Circuit Breaker

Categories: proactive, before the fact

It is hard to categorize the circuit breaker because it is pro- and reactive at the same time. It detects that a given downstream system is malfunctioning (reactive) and it protects the downstream systems from being flooded with new requests (proactive).

This is one of the most complex patterns mainly because it uses different states to define different behaviours. Before we jump into the details lets see why this tool exists at all:

Circuit breaker detects failures and prevents the application from trying to perform the action that is doomed to fail (until it is safe to retry) - Wikipedia

So, this tool works as a mini data and control plane. The requests go through this proxy, which examines the responses (if any) and it counts subsequent failures. If a predefined threshold is reached then the transfer is suspended temporarily and it fails immediately.

Why is it useful?

It prevents cascading failures. In other words the transient failure of a downstream system should not be propagated to the upstream systems. By concealing the failure we are actually preventing a chain reaction (domino effect) as well.

How does it know when a transient failure is gone?

It must somehow determine when would be safe to operate again as a proxy. For example it can use the same detection mechanism that was used during the original failure detection. So, it works like this: after a given period of time it allows a single request to go through and it examines the response. If it succeeds then the downstream is treated as healthy. Otherwise nothing changes (no request is transferred through this proxy) only the timer is reset.

What states does it use?

The circuit breaker can be in any of the following states: Closed, Open, HalfOpen.

Closed: It allows any request. It counts successive failed requests.
- If the successive failed count is below the threshold and the next request succeeds then the counter is set back to 0.
- If the predefined threshold is reached then it transitions into Open
Open: It rejects any request immediately. It waits a predefined amount of time.
- If that time is elapsed then it transitions into HalfOpen
HalfOpen: It allows only one request. It examines the response of that request:
- If the response indicates success then it transitions into Closed
- If the response indicates failure then it transitions back to Open

Resiliency strategy

The above two mechanisms / policies are not mutually exclusive, on the contrary. They can be combined via the escalation mechanism. If the inner policy can't handle the problem it can propagate one level up to an outer policy.

When you try to perform a request while the Circuit Breaker is Open then it will throw an exception. Your retry policy could trigger for that and adjust its sleep duration (to avoid unnecessary attempts).

The downstream system can also inform upstream that it is receiving too many requests with 429 status code. The Circuit Breaker could also trigger for this and use the Retry-After header's value for its sleep duration.

So, the whole point of this section is that you can define a protocol between client and server how to overcome on transient failures together.