architecture webhooks system-design exponential-backoff

Where to implement Exponential Backoff algorithm in a Controller-worker system?

I am trying to create a system where I need to implement the exponential backoff algorithm. I have a controller and a worker. The worker is the one that sends the request to a particular URL and waits for the response. The controller just assigns the task to the workers that are free. Incase the request from the worker fails, the failure status of the request is entered into a database.

To implement the exponential backoff algorithm, should the controller be running a separate thread to identify failed requests from the Database. Or is there something that can be done at the worker level without holding up the worker for the duration of the retries?

Solution

In many cases, retries with backoff algorithms are used inside workers. Basically, if a controller calls a worker, the controller just wants to get the job done and retries help to mitigate various temporary issues, like tiny network issues.

The typical logic is (when a worker is called to run a task):

before calling a request, the worker creates a counter C with initial value of zero; and sets a max attempts value as a configuration, e.g. 3
the worker waits for C*some_delay time; where some_delay is the interval configured manually (more on this later)
worker makes a request
if the request fails, the worker checks if all attempts are done, if so, the failure get sent back to the controller; otherwise, C get increased and the worker goes to step 2

At the end of the day, it will be several calls to a failed resource with delay being increased after each failure.

The delay constant (some_delay in the above text) is picked based on overall system architecture. How long the controller can wait? If the controller itself timeouts at some point (or controllers customers timeout), then the sum of all intervals must be less than that timeout - otherwise there is no point to retry jobs as customers won't be able to get results anyway.

One more topic to consider is what is the thread management approach in your application. While a worker waits for the next retry, the thread will be busy sleeping, that may or may not be a problem.

And the last extra point, if you already have a backoff retry, it may make sense to consider adding a circuit breaker pattern; so if a remote resource is down, the system won't waste time retrying all the time (and keeping threads busy with nothing).