I have a code using Fortran modules. I can build it with no problems under normal circumstances. CMake takes care of the ordering of the module files.
However, using a gitlab runner, it SOMETIMES happens that cmake does NOT order the Fortran modules by dependencies, but alphabetically instead, which than leads to a build failure.
The problem seems to occur at random. I have a branch that built in the CI. After adding a commit, that modified a utility script not involved in any way in the build, I ran into this problem. There is no difference in the output of the cmake configure step.
I use the matrix configuration for the CI to test different configurations. I found, that I could trigger this by adding another mpi version (e.g. openmpi/4.1.6). Without that version, it built. With it added in the matrix, ALL configurations showed the problem.
stages:
- configure
- build
- test
.basic_config:
tags:
- hpc_runner
variables:
# load submodules
GIT_SUBMODULE_STRATEGY: recursive
.config_matrix:
extends: .basic_config
# define job matrix
parallel:
matrix:
- COMPILER: [gcc/9.4.0]
PARALLELIZATION: [serial, openmpi/3.1.6]
TYPE: [option1, option2]
BUILD_TYPE: [debug, release]
- COMPILER: [gcc/10.3.0, intel/19.0.5]
PARALLELIZATION: [serial]
TYPE: [option2]
BUILD_TYPE: [debug]
###############################################################################
# setup script
# These commands will run before each job.
before_script:
- set -e
- uname -a
- |
if [[ "$(uname)" = "Linux" ]]; then
export THREADS=$(nproc --all)
elif [[ "$(uname)" = "Darwin" ]]; then
export THREADS=$(sysctl -n hw.ncpu)
else
echo "Unknown platform. Setting THREADS to 1."
export THREADS=1
fi
# load environment
- source scripts/build/load_environment $COMPILER $BUILD_TYPE $TYPE $PARALLELIZATION
# set path for build folder
- build_path=build/$COMPILER/$PARALLELIZATION/$TYPE/$BUILD_TYPE
configure:
stage: configure
extends: .config_matrix
script:
- mkdir -p $build_path
- cd $build_path
- $CMAKE_COMMAND
artifacts:
paths:
- build
expire_in: 1 days
###############################################################################
# build script
build:
stage: build
extends: .config_matrix
script:
- cd $build_path
- make
artifacts:
paths:
- build
expire_in: 1 days
needs:
- configure
###############################################################################
# test
test:
stage: test
extends: .config_matrix
script:
- cd $build_path
- ctest --output-on-failure
needs:
- build
The runner runs on an HPC machine which a complex setup, and I am not to familiar with the exact configuration. I contacted the admin with this problem, but wanted to see if anybody else had run into this before and have solutions or hints on what is going on.
With the help from our admin I figured it out.
The problem comes from cmake using absolute paths. The runner has actually several runners for parallel jobs, with each using a different prefix path, e.g. /runner/001/
or /runner/012/
. So when I run configure on a specific runner, cmake saves that prefix path to the configuration.
Now in the build stage, there is no guarantee to have the same configuration run on the same runner. However, since there are absolute paths in the make files, make tries to access the folders in the configure runner's prefix. Now, that can be anything from non-existing, over old files from previous pipelines to the correct files downloaded by another case.
The only fix I currently can see is to run everything on the same runner in one stage, to avoid the roulette of prefix paths. If anybody has a different idea, or if there is a way to fix a specific matrix case to a specific runner prefix, please comment.