Search code examples
rundeck

Limit retries by time instead of attempts


I have a job that processes items from a queue. When the queue is empty, it finishes in seconds. However, if there are items in the queue, it will sometimes take minutes (or hours) to finish.

I want it to automatically retry after a failure (5 minute wait) UNLESS the last successful job was more than 1 hour ago. If it fails and the last success was more than 1 hour ago, I want it to notify us (via notification plugin)

How can I do that?


Solution

  • You can create a "monitor job" that verifies the status of your "main job" using the Rundeck API "embeded" inside an inline-script step.

    This job has an inline script step that verifies the latest status of the main job and calculates the minutes between the latest success execution and current time, all via API, the scrip needs jq and dateutils. If the condition happens, disable the schedule on the main job and send a notification email using the "On Failure" notification

    HelloWorld job ("Main Job", the job to be monitored as you said, has a 5mins retry time, I suppose that is a scheduled job).

    <joblist>
      <job>
        <defaultTab>nodes</defaultTab>
        <description></description>
        <executionEnabled>true</executionEnabled>
        <id>3fc55b67-b2a1-4e92-b949-7ed464041993</id>
        <loglevel>INFO</loglevel>
        <name>HelloWorld</name>
        <nodeFilterEditable>false</nodeFilterEditable>
        <plugins />
        <retry delay='5m'>1</retry>
        <schedule>
          <month month='*' />
          <time hour='17' minute='15' seconds='0' />
          <weekday day='*' />
          <year year='*' />
        </schedule>
        <scheduleEnabled>true</scheduleEnabled>
        <sequence keepgoing='false' strategy='node-first'>
          <command>
            <exec>eho "hello world"</exec>
          </command>
        </sequence>
        <uuid>3fc55b67-b2a1-4e92-b949-7ed464041993</uuid>
      </job>
    </joblist>
    

    And the "Monitor" job, this job verifies all conditions on an inline-script using Rundeck API:

    <joblist>
      <job>
        <defaultTab>nodes</defaultTab>
        <description></description>
        <executionEnabled>true</executionEnabled>
        <id>7a4b117c-75cc-4b49-a504-703e27941170</id>
        <loglevel>INFO</loglevel>
        <name>Monitor</name>
        <nodeFilterEditable>false</nodeFilterEditable>
        <notification>
          <onfailure>
            <email attachLog='true' attachLogInFile='true' recipients='[email protected]' subject='job failure' />
          </onfailure>
        </notification>
        <notifyAvgDurationThreshold />
        <plugins />
        <schedule>
          <month month='*' />
          <time hour='*' minute='0/15' seconds='0' />
          <weekday day='*' />
          <year year='*' />
        </schedule>
        <scheduleEnabled>true</scheduleEnabled>
        <sequence keepgoing='false' strategy='node-first'>
          <command>
            <description>Monitor the main job</description>
            <fileExtension>.sh</fileExtension>
            <script><![CDATA[# get the latest execution
    latest_execution_status=$(curl -s --location --request GET 'http://localhost:4440/api/38/job/3fc55b67-b2a1-4e92-b949-7ed464041993/executions?max=1' \
    --header 'Accept: application/json' \
    --header 'X-Rundeck-Auth-Token: 6pSjAdcnNInfJ3Y8PHoC6EN7KGzjecEe' \
    --header 'Content-Type: application/json' | jq -r '.executions | .[].status')
    
    # now get the latest successful execution date
    latest_succeeded_execution_date=$(curl -s --location --request GET 'http://localhost:4440/api/38/job/3fc55b67-b2a1-4e92-b949-7ed464041993/executions?status=succeeded&max=1' \
    --header 'Accept: application/json' \
    --header 'X-Rundeck-Auth-Token: 6pSjAdcnNInfJ3Y8PHoC6EN7KGzjecEe' \
    --header 'Content-Type: application/json' | jq -r '.executions | .[]."date-ended".date' | sed 's/.$//')
    
    # the current date (to compare with the latest execution date)
    current_date=$(date --iso-8601=seconds)
    
    # comparing the two dates using dateutils
    time_between_latest_succeeded_and_current_time=$(dateutils.ddiff $latest_succeeded_execution_date $current_date -f "%M")
    
    # just for debug
    echo "LATEST EXECUTION STATE: $latest_execution_status"
    echo "LATEST SUCCEEDED EXECUTION DATE: $latest_succeeded_execution_date"
    echo "CURRENT DATE: $current_date"
    echo "MINUTES SINCE LATEST SUCCEEDED EXECUTION: $time_between_latest_succeeded_and_current_time minutes"
    
    # If it fails and the last success was more than 1 hour ago, we want it to notify us (via notification plugin)
    if [ $latest_execution_status = 'failed' ] && [ $time_between_latest_succeeded_and_current_time -gt 60 ]; then
      # disable main job schedule
      curl -s --location --request POST 'http://localhost:4440/api/38/job/3fc55b67-b2a1-4e92-b949-7ed464041993/schedule/disable' --header 'Accept: application/json' --header 'X-Rundeck-Auth-Token: 6pSjAdcnNInfJ3Y8PHoC6EN7KGzjecEe' --header 'Content-Type: application/json' 
    
      # with exit 1 this job fails and the notification "on failure" is triggered
      exit 1 
    else
        echo "all ok!"
    fi
    
    # all done.]]></script>
            <scriptargs />
            <scriptinterpreter>/bin/bash</scriptinterpreter>
          </command>
        </sequence>
        <uuid>7a4b117c-75cc-4b49-a504-703e27941170</uuid>
      </job>
    </joblist>
    

    Here the script if you need it separately:

    # get the latest execution
    latest_execution_status=$(curl -s --location --request GET 'http://localhost:4440/api/38/job/3fc55b67-b2a1-4e92-b949-7ed464041993/executions?max=1' \
    --header 'Accept: application/json' \
    --header 'X-Rundeck-Auth-Token: 6pSjAdcnNInfJ3Y8PHoC6EN7KGzjecEe' \
    --header 'Content-Type: application/json' | jq -r '.executions | .[].status')
    
    # now get the latest successful execution date
    latest_succeeded_execution_date=$(curl -s --location --request GET 'http://localhost:4440/api/38/job/3fc55b67-b2a1-4e92-b949-7ed464041993/executions?status=succeeded&max=1' \
    --header 'Accept: application/json' \
    --header 'X-Rundeck-Auth-Token: 6pSjAdcnNInfJ3Y8PHoC6EN7KGzjecEe' \
    --header 'Content-Type: application/json' | jq -r '.executions | .[]."date-ended".date' | sed 's/.$//')
    
    # the current date (to compare with the latest execution date)
    current_date=$(date --iso-8601=seconds)
    
    # comparing the two dates using dateutils
    time_between_latest_succeeded_and_current_time=$(dateutils.ddiff $latest_succeeded_execution_date $current_date -f "%M")
    
    # just for debug
    echo "LATEST EXECUTION STATE: $latest_execution_status"
    echo "LATEST SUCCEEDED EXECUTION DATE: $latest_succeeded_execution_date"
    echo "CURRENT DATE: $current_date"
    echo "MINUTES SINCE LATEST SUCCEEDED EXECUTION: $time_between_latest_succeeded_and_current_time minutes"
    
    # If it fails and the last success was more than 1 hour ago, we want it to notify us (via notification plugin)
    if [ $latest_execution_status = 'failed' ] && [ $time_between_latest_succeeded_and_current_time -gt 60 ]; then
      # disable main job schedule
      curl -s --location --request POST 'http://localhost:4440/api/38/job/3fc55b67-b2a1-4e92-b949-7ed464041993/schedule/disable' --header 'Accept: application/json' --header 'X-Rundeck-Auth-Token: 6pSjAdcnNInfJ3Y8PHoC6EN7KGzjecEe' --header 'Content-Type: application/json' 
    
      # with exit 1 this job fails and the notification "on failure" is triggered
      exit 1 
    else
        echo "all ok!"
    fi
    
    # all done.
    

    Of course parameters as job id, host, etc can be parameterized using options.