Search code examples
amazon-web-servicesamazon-ecsautoscalingaws-fargateaws-auto-scaling

AWS ECS Fargate autoscaling with CloudFormation not working; CloudWatch insufficient data


Summary: I can't get ECS Fargate autoscaling to work with CloudFormation, even duplicating a console manual setup.

In AWS ECS I have manually set up a single cluster with a single service and a single task using a load balancer with autoscaling. To make testing easy I use ALDRequestCountPerTarget, with a target of 3 and scale in/out of 30 seconds. This works! When I reload a page over and over, the service scales the number of tasks. Furthermore even before autoscaling kicks in, in CloudWatch I can see the requests go up for the TargetTracking high and low alarms.

I'm trying to replicate this in CloudFormation. My CloudFormation stacks produces what looks like an identical configuration. But there's one problem: the cluster never autoscales, and the two TargetTracking CloudWatch metrics never show any information; instead they continue to show "Insufficient data" (with no data points whatsoever) no matter how many times I reload the web page that is hitting the load balancer.

I have verified via the console that the service definition and autoscaling configuration are identical in every aspect I can think of. CloudFormation and the scaling policy even creates appropriate CloudWatch metrics for me; those, too, are configured identically to those I created manually via the console. They point to the correct load balancer and target group (those created by the CloudFormation template). But these CloudWatch metrics receive no data.

If I open the stack load balancer, under monitoring I can see it is getting requests. Somehow this data is not getting to CloudWatch. In CloudWatch the appropriate log group was created, but the log stream only has Spring startup output—but that's all the other log group (from the manually-created cluster) has as well, so I don't see how that could be relevant. Besides, the CloudWatch data on which ALBRequestCountPerTarget is based should be coming from the load balancer, no?

Here is my log group definition in CloudFormation:

  LogGroup:
    Type: AWS::Logs::LogGroup
    Properties: 
      LogGroupName: '/ecs/foo-bar'
      RetentionInDays: 400

And here is the task container definition log configuration:

          LogConfiguration:
            LogDriver: awslogs
            Options:
              awslogs-group: !Ref LogGroup
              awslogs-region: !Ref AWS::Region
              awslogs-stream-prefix: 'ecs'

I'm not sure if the log group is even relevant, though, since I understand these CloudWatch metrics are based on the load balancer, not the task container output.

Here's the autoscaling definition:

  AutoScalingTarget:
    Type: AWS::ApplicationAutoScaling::ScalableTarget
    Properties:
      ServiceNamespace: ecs
      ResourceId: !Sub "service/${ECSCluster}/${ECSService.Name}"
      ScalableDimension: ecs:service:DesiredCount
      MinCapacity: 1
      MaxCapacity: 3
      RoleARN: !Sub "arn:aws:iam::${AWS::AccountId}:role/aws-service-role/ecs.application-autoscaling.amazonaws.com/AWSServiceRoleForApplicationAutoScaling_ECSService"

  AutoScalingPolicy:
    Type: AWS::ApplicationAutoScaling::ScalingPolicy
    Properties:
      PolicyName: foo-bar-policy
      PolicyType: TargetTrackingScaling
      ScalingTargetId: !Ref AutoScalingTarget
      TargetTrackingScalingPolicyConfiguration:
        PredefinedMetricSpecification:
          PredefinedMetricType: ALBRequestCountPerTarget
          ResourceLabel: !Sub "${LoadBalancer.LoadBalancerFullName}}/${TargetGroup.TargetGroupFullName}"
        ScaleInCooldown: 30
        ScaleOutCooldown: 30
        TargetValue: 3

Here's something strange, but probably unrelated: when I added autoscaling using the above snippet, but with a PolicyName of "Foo Bar Policy" (with spaces); when I went to Clusters > Services > Tasks, under the "Logs" tab the per-task log never showed up. The log group was still created and showed up under CloudWatch, and its stream still showed the container startup output. Changing the PolicyName to foo-bar-policy without spaces brought back this per-task startup log in the console. In any case I doubt this is even relevant to the nonfunctioning autoscaling, as I believe my autoscaling configuration is based upon the load balancer requests.

Update: If I browse CloudWatch metrics, I can see a list of RequestCountPerTarget metrics. Two of them are for my manually-created and CloudFormation-created target groups, respectively:

  • targetgroup/manually-created/12345… RequestCountPerTarget
  • targetgroup/CloudFormation-created/abcde… RequestCountPerTarget

Both of them show graphed data, and for the CloudFormation-created target group I can make the value increase by reloading the web page linked to that load balancer + target group! So the data is getting into CloudWatch.

But the CloudWatch alarm for the CloudFormation stack still shows "insufficient data" (i.e. no data). That alarm indicates the correct load balancer and target group, and looks identical to the one resulting from manually creating the stack. Remember that AWS itself created both these alarms. Why isn't the CloudFormation one showing the data I can see in the metric?


Solution

  • The reason the CloudWatch alarm created with the CloudFormation Stack couldn't see the actual CloudWatch metric values for the load balancer target group was because of a typo:

    ResourceLabel: !Sub "${WebLoadBalancer.LoadBalancerFullName}}/${WebTargetGroup.TargetGroupFullName}"
    

    I had inadvertently added an extra }. It should have been:

    ResourceLabel: !Sub "${WebLoadBalancer.LoadBalancerFullName}/${WebTargetGroup.TargetGroupFullName}"
    

    When I was comparing the manually-created and CloudFormation-created CloudWatch alarms, I finally noticed that the latter's reference to the target group ended with a }, when the target group's full name did not end in }. In fact it would seem that there is no way a load balancer target group could truly end in }. But AWS happily accepted this invalid value and put up a CloudWatch alarm that was connected to a target group that didn't exist, using a reference that could never exist.

    Once I fixed the typo, things are working now. One would think that AWS might validate things here or there and help out a poor developer …