Search code examples
queuedistributed

Distributed queue consumers in an unstable net


I'm working on the design of a distributed system. The system consists of multiple producers, distributed queue and multiple consumers aka workers.

Workers instances resides within datacentres in different locations. Sometimes one location is manually disconnected.

In such a case, the issue is the worker from the disconnected location got some task from the queue and is then shutting down before task completion. I want:

  1. workers from an alive location be able to got such a task and complete it

  2. when a disconnected worker finally turns on, it should determine if the task was already completed by another worker and decide what to do with it

What is a convenient way to solve such an issue?


Solution

  • This design might help you. Every time a worker consumes a task, move the task from queue to some other distributed list of consumed tasks. In this list of tasks, maintain a timestamp with every task.

    Then the worker that consumed the task should send some kind of still alive message every second or so (similar to Hadoop's hearbeat message) that updates the timestamp of a task in consumed tasks list. This is to indicate that the worker who consumed this task is still alive and received a message from him recently.

    Now, implement a daemon to monitor this consumed tasks list and move the tasks back to queue whose timestamp is older than a threshold number of seconds (considering message losses).