Search code examples
gitlabgitlab-cicicd

gitlab: Can you trigger a rerun of a failed CI job based on log contents


We use GitLab for our terraform lock files. And we commonly get errors acquiring the lock and such. It's clearly a capacity issue on their end. If we rerun the job it almost always passes.

So I want to setup something that looks at the log, and if it sees that specific error text, reruns the job. Looking at their docs and such I don't see that as a possibility. But I can't imagine others haven't implemented something to do this.


Solution

  • While GitLab does not offer log message detection as a retry feature, it is possible to build this solution using the features that are available, including retry:exit_codes: -- or even just do this retry handling entirely in your script.

    The short version is this: you use your script logic to catch the specific error message and, when found, you either exit with a specific status code (which you will specify in retry:exit_codes:) or simply retry the script or command directly in the job script logic.

    There's obviously a lot of ways to handle this, one specific implementation in bash (or similar shell) may look like this:

    1. Choose a specific exit code and add it to your retry:exit_codes: list in your job YAML
    2. Ensure your script does not have set -e set (or use set +e)
    3. Tee the output of your command to a file
    4. Check for nonzero (unsuccessful) exit code (if successful, just continue normally, potentially re-setting set -e if desired)
    5. If the exit code is nonzero, search (parse or grep, etc) the output file for your specific error message
    6. If the error message is found, exit the job using the exit code declared in step 1. Otherwise, exit using the normal exit code
    myjob:
      variables:
        # sometimes required for correct exit code detection with some executors
        # uncomment if your script always exits with exit code 1
        # FF_USE_NEW_BASH_EVAL_STRATEGY: "1"
      retry:
        max: 4
        exit_codes:
          - 42
      script:
        - ./myscript.sh
    

    This is untested code, but should explain the basic idea:

    #!/usr/bin/env bash
    
    # myscript.sh
    
    set +e
    set -o pipefail
    
    err_msg="Error locking state"
    special_code=42
    
    terraform apply 2>&1 | tee -a output.txt
    
    exit_code=$?
    
    if [[ $exit_code -ne 0 ]]; then
        if grep -q "$err_msg" < output.txt; then
            exit $special_code
        else
            exit $exit_code
        fi
    fi
    
    set -e
    # ...
    

    Alternatively, just handle the retry logic entirely in bash, for example, as described in the Unix stackexchange question: Re-running a command in case of a specific error message.