Search code examples
amazon-ec2gitlab-ci-runner

GitLab Autoscaling Instance Executor keeps creating new instances all the time


I'm trying to understand how the autoscaling for GitLab Instance Executor works.

I have created a fleet manager, running on an AWS Ubuntu instance, which spins up new AWS Ubuntu runner instances to run jobs on. The runners are created and connected to successfully and run jobs fine.

I want the following behaviour:

  • Runners only run 1 job and then are destroyed
  • No runners are kept up at any time. Runners are only created when there a re jobs to run, and deleted afterwards.
  • The whole fleet should be able to handle 10 jobs concurrently (by spinning up 10 runners, if required)
  • If there are no jobs to run, no runners should be up

Here is my configuration:

concurrent = 10
[[runners]]
  output_limit = 50000
  [runners.autoscaler]
    plugin = "aws:latest"
    capacity_per_instance = 1
    max_use_count = 1
    max_instances = 10
    delete_instances_on_shutdown = true
    [runners.autoscaler.plugin_config] # plugin specific configuration (see plugin documentation)
      name = "gitlab-runner-linux"
      config_file = "/home/gitlab-runner/.aws/config"
      region = "eu-west-2"
    [runners.autoscaler.connector_config]
      username = "ubuntu"
      key_path = "/home/gitlab-runner/.ssh/runners.pem"
      use_external_addr = true
    [[runners.autoscaler.policy]]
      idle_count = 0
      idle_time = "120m0s"
      periods = ["* 8-19 * * mon-fri"]

What I am seeing is:

  • If I start with the ASG set to 0 desired count, and there are no jobs, nothing happens
  • When I launch a job, the desired count is set to 1 by the manager, and a new instance is created
  • The instance runs the job, and then it is destroyed
  • A new instance is created, and it stays up for 2 hours
  • The new instance is destroyed and a new one created, which is the destroyed after 2 hours
  • It keeps destroying and creating new instances every two hours

How can I configure the manager to NOT keep any instances up if there are no jobs? I thought the idle_count of 0 would do that, but it doesn't!

I have also tried to set the idle_time to 0s, but that results in instances being created and destroyed even faster. Here are the logs from the manager:

2024-05-23T09:05:09.684776+00:00 ip-10-0-51-245 gitlab-runner[33492]: increasing instances                                amount=1 group=aws/eu-west-2/gitlab-runner-linux runner=MQcjjD9Aq subsystem=taskscaler
2024-05-23T09:05:09.810977+00:00 ip-10-0-51-245 gitlab-runner[33492]: increasing instances response                       group=aws/eu-west-2/gitlab-runner-linux num_requested=1 num_successful=1 runner=MQcjjD9Aq subsystem=taskscaler
2024-05-23T09:05:09.811044+00:00 ip-10-0-51-245 gitlab-runner[33492]: increase update                                     group=aws/eu-west-2/gitlab-runner-linux pending=1 requesting=0 runner=MQcjjD9Aq subsystem=taskscaler total_pending=1
2024-05-23T09:05:09.996388+00:00 ip-10-0-51-245 gitlab-runner[33492]: required scaling change                             capacity-info={"InstanceCount":1,"MaxInstanceCount":10,"Acquired":0,"UnavailableCapacity":1,"Pending":0,"Reserved":0,"IdleCount":0,"ScaleFactor":0,"ScaleFactorLimit":0,"CapacityPerInstance":1} required=0 runner=MQcjjD9Aq subsystem=taskscaler
2024-05-23T09:05:15.289053+00:00 ip-10-0-51-245 gitlab-runner[33492]: instance discovery                                  cause=requested group=aws/eu-west-2/gitlab-runner-linux id=i-0497953d425528938 runner=MQcjjD9Aq state=creating subsystem=taskscaler
2024-05-23T09:05:19.463395+00:00 ip-10-0-51-245 process-agent[1969]: 2024-05-23 09:05:19 UTC | PROCESS | INFO | (pkg/process/runner/runner.go:434 in UpdateRTStatus) | Detected 1 active clients, enabling real-time mode
2024-05-23T09:05:20.783908+00:00 ip-10-0-51-245 gitlab-runner[33492]: instance update                                     group=aws/eu-west-2/gitlab-runner-linux id=i-0497953d425528938 runner=MQcjjD9Aq state=running subsystem=taskscaler
2024-05-23T09:05:21.288964+00:00 ip-10-0-51-245 gitlab-runner[33492]: ready                                               instance=i-0497953d425528938 runner=MQcjjD9Aq subsystem=taskscaler took=505.550737ms
2024-05-23T09:05:57.568785+00:00 ip-10-0-51-245 gitlab-runner[33492]: instance pruned                                     group=aws/eu-west-2/gitlab-runner-linux id=i-0bf7f9f9198f6bf45 lifetime=1m3.581813161s runner=MQcjjD9Aq subsystem=taskscaler
2024-05-23T09:05:58.023938+00:00 ip-10-0-51-245 gitlab-runner[33492]: required scaling change                             capacity-info={"InstanceCount":1,"MaxInstanceCount":10,"Acquired":0,"UnavailableCapacity":0,"Pending":0,"Reserved":0,"IdleCount":0,"ScaleFactor":0,"ScaleFactorLimit":0,"CapacityPerInstance":1} required=-1 runner=MQcjjD9Aq subsystem=taskscaler
2024-05-23T09:05:58.024025+00:00 ip-10-0-51-245 gitlab-runner[33492]: instance marked for removal                         instance=i-0497953d425528938 reason=instance exceeded max idle time runner=MQcjjD9Aq subsystem=taskscaler
2024-05-23T09:05:58.568862+00:00 ip-10-0-51-245 gitlab-runner[33492]: decreasing instances                                amount=1 group=aws/eu-west-2/gitlab-runner-linux runner=MQcjjD9Aq subsystem=taskscaler
2024-05-23T09:05:58.787138+00:00 ip-10-0-51-245 gitlab-runner[33492]: instance update                                     group=aws/eu-west-2/gitlab-runner-linux id=i-0497953d425528938 runner=MQcjjD9Aq state=deleting subsystem=taskscaler
2024-05-23T09:05:59.024277+00:00 ip-10-0-51-245 gitlab-runner[33492]: required scaling change                             capacity-info={"InstanceCount":0,"MaxInstanceCount":10,"Acquired":0,"UnavailableCapacity":1,"Pending":0,"Reserved":0,"IdleCount":0,"ScaleFactor":0,"ScaleFactorLimit":0,"CapacityPerInstance":1} required=1 runner=MQcjjD9Aq subsystem=taskscaler
2024-05-23T09:05:59.901087+00:00 ip-10-0-51-245 gitlab-runner[33492]: increasing instances                                amount=1 group=aws/eu-west-2/gitlab-runner-linux runner=MQcjjD9Aq subsystem=taskscaler
2024-05-23T09:06:00.024885+00:00 ip-10-0-51-245 gitlab-runner[33492]: required scaling change                             capacity-info={"InstanceCount":1,"MaxInstanceCount":10,"Acquired":0,"UnavailableCapacity":1,"Pending":0,"Reserved":0,"IdleCount":0,"ScaleFactor":0,"ScaleFactorLimit":0,"CapacityPerInstance":1} required=0 runner=MQcjjD9Aq subsystem=taskscaler
2024-05-23T09:06:00.092743+00:00 ip-10-0-51-245 gitlab-runner[33492]: increasing instances response                       group=aws/eu-west-2/gitlab-runner-linux num_requested=1 num_successful=1 runner=MQcjjD9Aq subsystem=taskscaler
2024-05-23T09:06:00.092879+00:00 ip-10-0-51-245 gitlab-runner[33492]: increase update                                     group=aws/eu-west-2/gitlab-runner-linux pending=1 requesting=0 runner=MQcjjD9Aq subsystem=taskscaler total_pending=1
2024-05-23T09:06:07.763403+00:00 ip-10-0-51-245 gitlab-runner[33492]: instance discovery                                  cause=requested group=aws/eu-west-2/gitlab-runner-linux id=i-0b0d5da76063b53cd runner=MQcjjD9Aq state=creating subsystem=taskscaler
2024-05-23T09:06:08.830624+00:00 ip-10-0-51-245 agent[1963]: 2024-05-23 09:06:08 UTC | CORE | INFO | (pkg/serializer/serializer.go:455 in SendProcessesMetadata) | Sent processes metadata payload, size: 1415 bytes.
2024-05-23T09:06:13.210473+00:00 ip-10-0-51-245 gitlab-runner[33492]: instance update                                     group=aws/eu-west-2/gitlab-runner-linux id=i-0b0d5da76063b53cd runner=MQcjjD9Aq state=running subsystem=taskscaler
2024-05-23T09:06:13.621823+00:00 ip-10-0-51-245 gitlab-runner[33492]: ready                                               instance=i-0b0d5da76063b53cd runner=MQcjjD9Aq subsystem=taskscaler took=411.63527ms
2024-05-23T09:06:41.579925+00:00 ip-10-0-51-245 gitlab-runner[33492]: instance pruned                                     group=aws/eu-west-2/gitlab-runner-linux id=i-0497953d425528938 lifetime=1m26.291104915s runner=MQcjjD9Aq subsystem=taskscaler
2024-05-23T09:06:42.050144+00:00 ip-10-0-51-245 gitlab-runner[33492]: required scaling change                             capacity-info={"InstanceCount":1,"MaxInstanceCount":10,"Acquired":0,"UnavailableCapacity":0,"Pending":0,"Reserved":0,"IdleCount":0,"ScaleFactor":0,"ScaleFactorLimit":0,"CapacityPerInstance":1} required=-1 runner=MQcjjD9Aq subsystem=taskscaler
2024-05-23T09:06:42.050255+00:00 ip-10-0-51-245 gitlab-runner[33492]: instance marked for removal                         instance=i-0b0d5da76063b53cd reason=instance exceeded max idle time runner=MQcjjD9Aq subsystem=taskscaler
2024-05-23T09:06:42.580989+00:00 ip-10-0-51-245 gitlab-runner[33492]: decreasing instances                                amount=1 group=aws/eu-west-2/gitlab-runner-linux runner=MQcjjD9Aq subsystem=taskscaler
2024-05-23T09:06:42.906422+00:00 ip-10-0-51-245 gitlab-runner[33492]: instance update                                     group=aws/eu-west-2/gitlab-runner-linux id=i-0b0d5da76063b53cd runner=MQcjjD9Aq state=deleting subsystem=taskscaler
2024-05-23T09:06:43.050667+00:00 ip-10-0-51-245 gitlab-runner[33492]: required scaling change                             capacity-info={"InstanceCount":0,"MaxInstanceCount":10,"Acquired":0,"UnavailableCapacity":1,"Pending":0,"Reserved":0,"IdleCount":0,"ScaleFactor":0,"ScaleFactorLimit":0,"CapacityPerInstance":1} required=1 runner=MQcjjD9Aq subsystem=taskscaler
2024-05-23T09:06:43.984642+00:00 ip-10-0-51-245 gitlab-runner[33492]: increasing instances                                amount=1 group=aws/eu-west-2/gitlab-runner-linux runner=MQcjjD9Aq subsystem=taskscaler
2024-05-23T09:06:44.051850+00:00 ip-10-0-51-245 gitlab-runner[33492]: required scaling change                             capacity-info={"InstanceCount":1,"MaxInstanceCount":10,"Acquired":0,"UnavailableCapacity":1,"Pending":0,"Reserved":0,"IdleCount":0,"ScaleFactor":0,"ScaleFactorLimit":0,"CapacityPerInstance":1} required=0 runner=MQcjjD9Aq subsystem=taskscaler
2024-05-23T09:06:44.170735+00:00 ip-10-0-51-245 gitlab-runner[33492]: increasing instances response                       group=aws/eu-west-2/gitlab-runner-linux num_requested=1 num_successful=1 runner=MQcjjD9Aq subsystem=taskscaler
2024-05-23T09:06:44.170805+00:00 ip-10-0-51-245 gitlab-runner[33492]: increase update                                     group=aws/eu-west-2/gitlab-runner-linux pending=1 requesting=0 runner=MQcjjD9Aq subsystem=taskscaler total_pending=1
2024-05-23T09:06:49.621586+00:00 ip-10-0-51-245 gitlab-runner[33492]: instance discovery                                  cause=requested group=aws/eu-west-2/gitlab-runner-linux id=i-07747636aea823e91 runner=MQcjjD9Aq state=creating subsystem=taskscaler
2024-05-23T09:06:56.171958+00:00 ip-10-0-51-245 gitlab-runner[33492]: instance update                                     group=aws/eu-west-2/gitlab-runner-linux id=i-07747636aea823e91 runner=MQcjjD9Aq state=running subsystem=taskscaler
2024-05-23T09:06:56.612399+00:00 ip-10-0-51-245 gitlab-runner[33492]: ready                                               instance=i-07747636aea823e91 runner=MQcjjD9Aq subsystem=taskscaler took=440.792863ms

Solution

  • gitlab-org/fleeting / plugins/aws issue 64 (that you mention) refers to fleeting/taskscaler issue 29 and gitlab-org/gitlab-runner MR 4818 ("Upgrade fleeting and taskscaler to fix instance churn/runaway"), which has just been merged (June 2024).

    Arran Walker (ajwalker) mentioned:

    I'm waiting for gitlab-org/gitlab-runner!4818 (merged) to be merged, and then we'll release a patched version of Runner.


    Since this is taken place with GitLab 17.1 (June 2024), it illustrates:

    GitLab Runner Autoscaler is generally available

    In earlier versions of GitLab, some customers needed an autoscaling solution for GitLab Runner on virtual machine instances on public cloud platforms. These customers had to rely on the legacy Docker Machine executor or custom solutions stitched together by using cloud provider technologies.

    Today, we’re pleased to announce the general availability of the GitLab Runner Autoscaler. The GitLab Runner Autoscaler is composed of GitLab-developed taskscaler and fleeting technologies and the cloud provider plugin for Google Compute Engine.

    https://about.gitlab.com/images/17_1/runner_fleeting_ga.png -- GitLab Runner Autoscaler is generally available

    See Documentation and Issue.

    The AWS GitLab fleeting is still in Beta, using AWS Auto Scaling groups.