Search code examples
amazon-web-servicesamazon-swf

Programmatically Re-Running SWF Workflows


We have a few thousand SWF workflows that have failed over the past year due to various activity bugs. Because the bugs were long-lived, all activity retries failed and the workflows were closed. I want to re-run all of those failed workflows, picking up at the activity that was last executed (and failed). A basic workflow retrigger.

The SWF console has a Re-Run command, but it only lets you select twenty-five workflows at a time, far fewer than the thousands I need.

I could use the CLI start-workflow-execution command (or analogous API call), but I can't figure out where to get the most recent workflow input the way the Console's 'Re-Run' operation does. I can get the most recent workflow input from get-workflow-execution-history, but that requires that I know the most recent runId and I can't find any way to get that.

To summarize:

  1. The only way I can think to programmatically re-run SWF workflows is: for each failed workflow, magically grab its most recent runId, then grab its most recent workflow input via get-workflow-execution-history, then restart it using that input via start-workflow-execution. Is there a better way?
  2. If the answer to #1 is "There is no better way," then how can I find the most recent runId for a particular workflowId?

(The fact that I can't find any documentation or discussion on such retriggers makes me worry that I am approaching this the wrong way, so I welcome feedback setting me straight.)

UPDATE: Higher level question: What is the right way to handle workflows that terminated due to error conditions that outlasted all retries? The fact that it is so difficult to retrigger SWF workflows makes me think I am misunderstanding the SWF paradigm.


Solution

  • I don't think that you can do it in this manner. The max workflow history retention is 90 days, so even if you go down the path of getting the workflow execution history you will be able to restart failed workflows for last 90 days Also aws has an account level restriction on the number and rate at which you can make swf api calls, so once you start making the calls in loop to get history and start workflow you reach this level too soon and start getting exception. Better way to approach this is to look at the point where the workflow execution was started from and re run the failed executions again by passing in the same input.