amazon-web-services azure rest microservices envoyproxy

How to implement resiliency (retry) in a nested service call chain

We have a webpage that queries an item from an API gateway which in turn calls a service that calls another service and so on.

Webpage --> API Gateway --> service#1 --> service#2 --> data store (RDMS, S3, Azure blob)

We want to make the operation resilient so we added a retry mechanism at every layer.

Webpage --retry--> API Gateway --retry--> service#1 --retry--> service#2 --retry--> data store.

This however could case a cascading failure because if the data store doesn't response on time, it will cause every layer to timeout and retry. In other words, if each layer has the same connection timeout and is configured to retry 3 times, then there will be a total of 81 retries to the data store (which is called a retry storm).

One way to fix this is to increase the timeout at each layer in order to give the layer below time to retry.

Webpage --5m timeout--> API Gateway --2m timeout--> service#1

This however is unacceptable because the timeout at the webpage will be too long.

How should I address this problem?

Should there only be one layer that retries? Which layer? And how can the layer know if the error is transient?

Solution

A couple possible solutions (and you can/should use both) would be to retry on different conditions and implement rate limiters/circuit breakers.

Retry On is a technique where you don't retry on every condition, but only specific conditions. This could be a specific error code or a specific header value. E.g. in your current situation, DO NOT retry on timeouts; only retry on server failures. In addition, you could have each layer retry on different conditions

Rate limiting would be to stick either a local or global rate limiter service inline to the connections. This would just help to short-circuit the thundering herd in the case that it starts up. E.g. rate limit the data layer to X req/s (insert real values here) and the gateway to Y req/s and then even if a service attempts lots of retries it won't pass too far down the chain. Similarly to this is circuit breaking, where each layer only permits X active connections to any downstream, so just another way to slow those retry storms.