gitlab gitlab-ci gitlab-ci-runner gitlab-ci.yml

Gitlab ci pipeline job tags for parallel run if enough runners or sequential run if not enough runners?

Our system currently has 3 runners on 3 machines with the same tag of acr:

runner: #1 / host: #1 / tag: acr
runner: #2 / host: #2 / tag: acr
runner: #3 / host: #3 / tag: acr

The gitlab ci pipeline has 3 stages with 1 job per stage:

stages:
  - stage-1
  - stage-2
  - stage-3

job-1:
  stage: stage-1
  tags:
    - acr
  script:
    - python run-script-1.py
  ...

job-2:
  stage: stage-2
  tags:
    - acr
  script:
    - python run-script-2.py
  ...

job-3:
  stage: stage-3
  tags:
    - acr
  script:
    - python run-script-3.py
  ...

Each job usually takes 7 mins to execute, making the whole pipeline take ~20 mins to complete.

As the 3 jobs are independent and can run in parallel, so we reassigned tags to the runners:

runner: #1 / host: #1 / tag: acr-1
runner: #2 / host: #2 / tag: acr-2
runner: #3 / host: #3 / tag: acr-3

The gitlab ci pipeline is also refactored so that it now has 1 stage with 3 jobs in the stage and each job is associated with a unique runner tag:

stages:
  - stage-all

job-1:
  stage: stage-all
  tags:
    - acr-1
  script:
    - python run-script-1.py
  ...

job-2:
  stage: stage-all
  tags:
    - acr-2
  script:
    - python run-script-2.py
  ...

job-3:
  stage: stage-all
  tags:
    - acr-3
  script:
    - python run-script-3.py
  ...

Now, if the 3 runners are all available, the ci pipeline takes ~7 mins to complete. But, the problem is that, in a bad day, 1 or 2 runners could be down for a while. This breaks the pipeline.

Is there a way to assign tags or arrange jobs so that, if enough runners are available the jobs will run concurrently, and if the runners are in shortage the jobs will run subsequently?

Solution

For those who are interested.

We solve the issue by a combination of the below configs:

Set all runners' concurrent limit in config.toml to 1
Make sure there is only one runner on each host (machine). Delete all other runners even if they aren't shown on gitlab ci settings page - I don't know why, but this really matters, at least with our tests.
Set the same tag for all runners (e.g. acr)
Use the same tag for all jobs (i.e. the acr)

With these settings, if there are enough available runners, jobs are distributed evenly, thus accelerating the pipeline. Otherwise, unpicked jobs are queued and will get executed sequentially once a runner completes its current work.