asynchronous design-patterns architecture microservices jobs

Handle DB call failure in background jobs

I have the below workflow:

UI makes an API call to delete a resource. This deletion is time taking operation where a lot of external calls are made. So actual deletion is run as an async/background workflow.
The API handler changes the state of the resource in db to Deleting, starts a background process and returns ok to the caller. There is a separate job table that keeps track of these background process & its state.
UI will continue to poll the svc to get the current state of the resource. After step2, it will be Deleting. If everything goes well, background process will either complete the workflow & change state to Deleted or if there is an err, state becomes DeleteFailed. This status will be reflected in UI.

Note: Once the state is Deleting, delete button is disabled on UI. If DeleteFailed, Delete button is enabled again & the caller can call Delete API again

Problem:

If there is an infra/db failure during background thread execution, the resource will continue to be in Deleting state forever.

Solution:

Infra failure - have added Job recovery code on node startup (from main method) which will check if there are jobs in running state and if so rerun the job. Skipping implementation details here. This is solved
DB call failure - In this case, the job completes (could be success/failure), but the DB record is left in hanging state (Deleting). How to solve this ?
- Making another Delete call from UI is not an option as 'Delete button' is disabled
- Only solution I can think of - have a Sanitiser/Cleanup thread that periodically scans the table for records that are in hanging state & clean it up.

Thoughts? Any better solution? There should be some standard way to solve this as it seems like a common problem.

Solution

I would consider the following basic things when designing such a flow:

Identify your "master" i.e the guy controlling the whole operation. If the master itself doesnt have state, you would need to have a state master that has has state, ideally first subpart of your whole thing.
What are your actual "atomic" subparts of the whole operation? These subparts can be different services handling their atomicity recurcively as this. i.e Using their own staged transactions, or single atomic db transactions
Once you have subparts, see what are the associated tracker states needed.
Once you have those, try and see if you really need atomicity in given subparts. Can you reduce the number of subparts to make things simpler. Lesser the atomic or dont need atomicity king subparts, the better.
Once these is there, you need to do a "staged transaction", where master handles its state and keeps track of it and attempts the whole thing.
Define your retry policies. Ideally you should do few inline retries in a subpart operation, then if that doesnt work out, you need to abort operation there and retry later as per some criterion, i.e infra issues, db issues, manual intervention needed, etc.
Do you "out of band" retries, either using a cleanup thread or a "dead letter queue". The second one will need to have a additional operation at the start of your process.
Define your failure conditions and/or report/alert/manual intervention required conditions and do that wherever needed, i.e inline or out of band.

In your particular case, If your db failures are in the master, you can do a atomic transaction and detect it later as a db commit failure. If it is in a subpart and you want the state to be in itself, then that subpart would need to handle failure.

Cleanup threads doing a scan may not be a great idea if you dont have a way to get the "stuck" operations efficiently. In that case a dead letter queue would be simpler.