amazon-web-services terraform amazon-ecs

ECS scale in with capacity provider to minimum capacity of ASG

I have ECS cluster configured with target tracking policy on service and capacity provider which managing ASG autoscaling

In my cluster count of minimum and maximum tasks in service and minimum and maximum capacity in ASG the same .

When performed scale in action, tasks decreased to minimum count. But ASG still have 1 or more unused ( task not placed on this EC2 instance ) ec2 instance

How can i configure my cluster with capacity provider to perform scale in to minimum count ASG capacity ?


# CLUSTER
resource "aws_ecs_cluster" "default" {
  name               = local.name
  capacity_providers = [aws_ecs_capacity_provider.asg.name]
  tags               = local.tags

  default_capacity_provider_strategy {
    base = 0
    capacity_provider = aws_ecs_capacity_provider.asg.name
    weight = 1
  }
}

# SERVICE
resource "aws_ecs_service" "ecs_service" {
  name            = "${local.name}-service"
  cluster         = aws_ecs_cluster.default.id
  task_definition = aws_ecs_task_definition.ecs_task.arn
  health_check_grace_period_seconds = 60

  deployment_maximum_percent         = 50
  deployment_minimum_healthy_percent = 100


  load_balancer {
    target_group_arn = element(module.aws-alb-common-module.target_group_arns, 1)
    container_name   = local.name
    container_port   = 8080
  }

  lifecycle {
    ignore_changes = [desired_count, task_definition]
  }


}

# CAPACITY PROVIDER
resource "aws_ecs_capacity_provider" "asg" {
  name = aws_autoscaling_group.ecs_nodes.name

  auto_scaling_group_provider {
    auto_scaling_group_arn         = aws_autoscaling_group.ecs_nodes.arn
    managed_termination_protection = "DISABLED"

    managed_scaling {
      maximum_scaling_step_size = 10
      minimum_scaling_step_size = 1
      status                    = "ENABLED"
      target_capacity           = 100
    }
  }
}

# SERVICE AUTOSCALING POLICY

resource "aws_appautoscaling_target" "ecs_target" {
  max_capacity       = 20
  min_capacity       = 2
  resource_id        = "service/${local.name}/${aws_ecs_service.ecs_service.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

resource "aws_appautoscaling_policy" "ecs_policy" {
  name = "${local.name}-scale-policy"
  policy_type = "TargetTrackingScaling"
  resource_id = aws_appautoscaling_target.ecs_target.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_target.scalable_dimension
  service_namespace = aws_appautoscaling_target.ecs_target.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }

    target_value = 2

  }


# ASG
resource "aws_autoscaling_group" "ecs_nodes" {
  name_prefix           = "${local.name}-node"
  max_size              = 20
  min_size              = 2
  vpc_zone_identifier   = local.subnets_ids
  protect_from_scale_in = false

  mixed_instances_policy {
    instances_distribution {
      on_demand_percentage_above_base_capacity = local.spot
    }
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.node.id
        version            = "$Latest"
      }

      dynamic "override" {
        for_each = local.instance_types
        content {
          instance_type     = override.key
          weighted_capacity = override.value
        }
      }
    }
  }

  lifecycle {
    create_before_destroy = true
  }

  tag {
    key                 = "AmazonECSManaged"
    propagate_at_launch = true
    value               = ""
  }
}

Solution

The cause is likely that the predefined_metric_specification block target_value = 2 is the cpu usage trigger level (percent), not the minimum capacity. The instance is probably being kept alive by background processes using small amounts of CPU.

By the way, the managed_termination_protection setting is probably worth reenabling.

Update in response to comments on 25/09:

Ok, it's entirely possible I'm wrong here (especially as I haven't used this feature myself yet), and if so I'm very happy to learn from it.

But this is how I read the mentioned documentation in relation to your config: The key phrase is The target capacity value is used as the target value for the CloudWatch metric used in the Amazon ECS-managed target tracking scaling policy. The cloudwatch metric you have selected is ECSServiceAverageCPUUtilization, which is discussed at How is ECSServiceAverageCPUUtilization metric caluclated?. So the target=2 you have configured means 2% average CPU utilisation.

I admit I mistakenly assumed the CPU metric was an EC2-instance-level average. But in either case, having your trigger value set to 2% CPU is likely to cause/maintain scaleout when none is needed.

It's also possible you've found the simple explanation for the behaviour you're seeing, i.e. the but this behavior is not guaranteed at all times statement. However I suspect that statement applies more to the extreme example of target 100% where one can expect to see anomalies, just as that can be expected at the similarly extreme 2%.