gitlab: Can you trigger a rerun of a failed CI job based on log contents

We use GitLab for our terraform lock files. And we commonly get errors acquiring the lock and such. It's clearly a capacity issue on their end. If we rerun the job it almost always passes.

So I want to setup something that looks at the log, and if it sees that specific error text, reruns the job. Looking at their docs and such I don't see that as a possibility. But I can't imagine others haven't implemented something to do this.

Solution

While GitLab does not offer log message detection as a retry feature, it is possible to build this solution using the features that are available, including retry:exit_codes: -- or even just do this retry handling entirely in your script.

The short version is this: you use your script logic to catch the specific error message and, when found, you either exit with a specific status code (which you will specify in retry:exit_codes:) or simply retry the script or command directly in the job script logic.

There's obviously a lot of ways to handle this, one specific implementation in bash (or similar shell) may look like this:

Choose a specific exit code and add it to your retry:exit_codes: list in your job YAML
Ensure your script does not have set -e set (or use set +e)
Tee the output of your command to a file
Check for nonzero (unsuccessful) exit code (if successful, just continue normally, potentially re-setting set -e if desired)
If the exit code is nonzero, search (parse or grep, etc) the output file for your specific error message
If the error message is found, exit the job using the exit code declared in step 1. Otherwise, exit using the normal exit code

myjob:
  variables:
    # sometimes required for correct exit code detection with some executors
    # uncomment if your script always exits with exit code 1
    # FF_USE_NEW_BASH_EVAL_STRATEGY: "1"
  retry:
    max: 4
    exit_codes:
      - 42
  script:
    - ./myscript.sh

This is untested code, but should explain the basic idea:

#!/usr/bin/env bash

# myscript.sh

set +e
set -o pipefail

err_msg="Error locking state"
special_code=42

terraform apply 2>&1 | tee -a output.txt

exit_code=$?

if [[ $exit_code -ne 0 ]]; then
    if grep -q "$err_msg" < output.txt; then
        exit $special_code
    else
        exit $exit_code
    fi
fi

set -e
# ...

Alternatively, just handle the retry logic entirely in bash, for example, as described in the Unix stackexchange question: Re-running a command in case of a specific error message.