Search code examples
rundeck

RunDeck: retry failed job but only for those nodes which has failed


I need to run ansible playbooks on a set of hosts (RunDeck nodes). But those nodes are often not reachable (IOT/home devices) and would like to have a job which is executing the following logic:

  • tries to execute specific playbook on a set of 100 nodes
  • retries only for those nodes for which it has failed (do not rerun for all 100 nodes again)
  • keep running forever to make sure that job is executed on all required nodes

Now: option https://docs.rundeck.com/docs/manual/creating-jobs.html#retry does seem to restart the whole job - so it's not what i want to achieve. Correct ? Is there any way of achieving the above ? (i run another/similar jobs on Apache Airflow and there i have a very good ability to retry only failed tasks)

Thanks,


Solution

  • Set your job to run parallelly in all nodes (edit your job, go to the nodes tab, scroll down to the "Thread Count" section, and set the number of nodes), then at the moment of failure on some nodes, you can run again only in failed nodes.