Search code examples
distributed-computingdistributedpollingdistributed-system

How to distribute tasks between servers where each task must be done by only one server?


Goal: There are X number backend servers. There are Y number of tasks. Each task must be done only by one server. The same task ran by two different servers should not happen.

There are tasks which include continuous work for an indefinite amount of time, such as polling for data. The same server can keep doing such a task as long as the server stays alive.

Problem: How to reassign a task if the server executing it dies? If the server dies, it can't mark the task as open. What are efficient ways to accomplish this?


Solution

  • Well, the way you define your problem makes it sloppy to reason about. What you actually is looking for called a "distributed lock".

    Let's start with a simpler problem: assume you have only two concurrent servers S1, S2 and a single task T. The safety property you stated remains as is: at no point in time both S1 and S2 may process task T. How could that be achieved? The following strategies come to mind:

    1. Implement an algorithm that deterministically maps task to a responsible server. For example, it could be as stupid as if task.name.contains('foo') then server1.process(task) else server2.process(task). That works and indeed might fit some real world requirements out there, yet such an approach is a dead end: a) you have to know how many server would you have upfront, statically and - the most dangerous - 2) you can not tolerate either server being down: if, say, S1 is taken off then there is nothing you can do with T right now except then just wait for S1 to come back online. These drawbacks could be softened, optimized - yet there is no way to get rid of them; escaping these deficiencies requires a more dynamic approach.

    2. Implement an algorithm that would allow S1 and S2 to agree upon who is responsible for the T. Basically, you want both S1 and S2 to come to a consensus about (assumed, not necessarily needed) T.is_processed_by = "S1" or T.is_processed_by = "S2" property's value. Then your requirement translates to the "at any point in time is_process_by is seen by both servers in the same way". Hence "consensus": "an agreement (between the servers) about a is_processed_by value". Having that eliminates all the "too static" issues of the previous strategy: actually, you are no longer bound to 2 servers, you could have had n, n > 1 servers (provided that your distributed consensus works for a chosen n), however it is not prepared for accidents like unexpected power outage. It could be that S1 won the competition, is_processed_by became equal to the "S1", S2 agreed with that and... S1 went down and did nothing useful....

    ...so you're missing the last bit: the "liveness" property. In simple words, you'd like your system to continuously progress whenever possible. To achieve that property - among many other things I am not mentioning - you have to make sure that spontaneous server's death is monitored and - once it happened - not a single task T gets stuck for indefinitely long. How do you achieve that? That's another story, a typical piratical solution would be to copy-paste the good old TCP's way of doing essentially the same thing: meet the keepalive approach.

    OK, let's conclude what we have by now:

    1. Take any implementation of a "distributed locking" which is equivalent to "distributed consensus". It could be a ZooKeeper done correctly, a PostgreSQL running a serializable transaction or whatever alike.
    2. Per each unprocessed or stuck task T in your system, make all the free servers S to race for that lock. Only one of them guaranteed to win and all the rest would surely loose.
    3. Frequently enough push sort of TCP's keepalive notifications per each processing task or - at least - per each alive server. Missing, let say, three notifications in a sequence should be taken as server's death and all of it's tasks should be re-marked as "stuck" and (eventually) reprocessed in the previous step.

    And that's it.

    P.S. Safety & liveness properties is something you'd definitely want to be aware of once it comes to distributed computing.