Our system currently has 3 runners on 3 machines with the same tag of acr
:
runner: #1 / host: #1 / tag: acr
runner: #2 / host: #2 / tag: acr
runner: #3 / host: #3 / tag: acr
The gitlab ci pipeline has 3 stages with 1 job per stage:
stages:
- stage-1
- stage-2
- stage-3
job-1:
stage: stage-1
tags:
- acr
script:
- python run-script-1.py
...
job-2:
stage: stage-2
tags:
- acr
script:
- python run-script-2.py
...
job-3:
stage: stage-3
tags:
- acr
script:
- python run-script-3.py
...
Each job usually takes 7 mins to execute, making the whole pipeline take ~20 mins to complete.
As the 3 jobs are independent and can run in parallel, so we reassigned tags to the runners:
runner: #1 / host: #1 / tag: acr-1
runner: #2 / host: #2 / tag: acr-2
runner: #3 / host: #3 / tag: acr-3
The gitlab ci pipeline is also refactored so that it now has 1 stage with 3 jobs in the stage and each job is associated with a unique runner tag:
stages:
- stage-all
job-1:
stage: stage-all
tags:
- acr-1
script:
- python run-script-1.py
...
job-2:
stage: stage-all
tags:
- acr-2
script:
- python run-script-2.py
...
job-3:
stage: stage-all
tags:
- acr-3
script:
- python run-script-3.py
...
Now, if the 3 runners are all available, the ci pipeline takes ~7 mins to complete. But, the problem is that, in a bad day, 1 or 2 runners could be down for a while. This breaks the pipeline.
Is there a way to assign tags or arrange jobs so that, if enough runners are available the jobs will run concurrently, and if the runners are in shortage the jobs will run subsequently?
For those who are interested.
We solve the issue by a combination of the below configs:
concurrent
limit in config.toml
to 1
acr
)acr
)With these settings, if there are enough available runners, jobs are distributed evenly, thus accelerating the pipeline. Otherwise, unpicked jobs are queued and will get executed sequentially once a runner completes its current work.