Search code examples
distributed-computingrecoverydistributed-systemerror-recovery

Distributed Recovery - can this be done without timeout?


We have a mail sender application, that receives a bunch of mails in one blob, and then puts all those mails into database. This can take up to ten minutes. During this process the state of the mailing is BUILDING.

When it is finished the state gets changed to READY.

When the server crashes (shouldn't happen of course) and restarts, it looks for all mailings with status BUILDING and marks them as ERROR. This happens, because we never want to send incomplete mailings.


Now we'd like to scale up using a second server. The recovery strategy above doesn't work here.

e.g. server 1 is BUILDING a mailing, and server 2 crashes and restarts. Now server 2 will see the BUILDING mailing and doesn't know if it's been aborted or if it's running on another server.


So what's the best recovery strategy for distributed services?

(We thought about some timeout mechanism, where the BUILDING server updates a timestamp every few seconds, and when some server reboots it checks if there's a BUILDING mailing that hasn't been updated for x minutes. Then it's highly possible that this mailing has been aborted.)


EDIT:

What I'd like to achieve: If some server restarts (after a crash or just because we added a new mailing server to the cluster), it should not mark mailings as ERROR if this particular mailing is actually being built (by another server).

Nice to have: If this would work without having to store server ids, because then it's possible to easily add and/or remove servers. Else it would not be possible to completely remove some server, because then there might be a BUILDING mailing with that particular server id. But this server got removed and will never get started again. Though the only server that could set the mailing to ERROR will be gone.


Solution

  • Add two things to your state tracking: a timestamp and the server working on it.

    If a server starts up and sees anything in a building state for itself it knows it failed. Conversely, if it starts up and sees something in a building state for another server, it now has information that it's going to need to look at later to see if there's a problem that needs to be addressed. You need to worry about multiple servers restarting at the same time, so you can't just have a server grab all old bundles for all servers at startup.

    Or you can just use a clustering service for your OS.