Search code examples
amazon-web-servicesamazon-ec2amazon-cloudwatchautoscalingcloudwatch-alarms

AWS CloudWatch Alarm to add capacity to EC2 autoscaling group has been in alarm forever


I set a CloudWatch Alarm to add 1 capacity unit to EC2 autoscaling group when memory reservation is > 70%. The Alarm was triggered at the right moment, but it has since been in alarm for 16 hours+ with no change at all in the EC2 autoscaling group. What could possibly be going wrong?

Here's my ECS CloudFormation template:

ECSCluster:
  Type: AWS::ECS::Cluster
  Properties:
    ClusterName: !Ref EnvironmentName

ECSAutoScalingGroup:
  DependsOn: ECSCluster
  Type: AWS::AutoScaling::AutoScalingGroup
  Properties:
    VPCZoneIdentifier: !Ref Subnets
    LaunchConfigurationName: !Ref ECSLaunchConfiguration
    MinSize: !Ref ClusterMinSize
    MaxSize: !Ref ClusterMaxSize
    DesiredCapacity: !Ref ClusterDesiredCapacity
  CreationPolicy:
    ResourceSignal:
      Timeout: PT15M
  UpdatePolicy:
    AutoScalingRollingUpdate:
      MinInstancesInService: 1
      MaxBatchSize: 1
      PauseTime: PT15M
      SuspendProcesses:
        - HealthCheck
        - ReplaceUnhealthy
        - AZRebalance
        - AlarmNotification
        - ScheduledActions
      WaitOnResourceSignals: true

ScaleUpPolicy:
  Type: AWS::AutoScaling::ScalingPolicy
  Properties:
    AdjustmentType: ChangeInCapacity
    AutoScalingGroupName: !Ref ECSAutoScalingGroup
    Cooldown: '1'
    ScalingAdjustment: '1'

MemoryReservationAlarmHigh:
  Type: AWS::CloudWatch::Alarm
  Properties:
    EvaluationPeriods: '2'
    Statistic: Average
    Threshold: '70'
    AlarmDescription: Alarm if Cluster Memory Reservation is too high
    Period: '60'
    AlarmActions:
    - Ref: ScaleUpPolicy
    Namespace: AWS/ECS
    Dimensions:
    - Name: ClusterName
      Value: !Ref ECSCluster
    ComparisonOperator: GreaterThanThreshold
    MetricName: MemoryReservation

ScaleDownPolicy:
  Type: AWS::AutoScaling::ScalingPolicy
  Properties:
    AdjustmentType: ChangeInCapacity
    AutoScalingGroupName: !Ref ECSAutoScalingGroup
    Cooldown: '1'
    ScalingAdjustment: '-1'

MemoryReservationAlarmLow:
  Type: AWS::CloudWatch::Alarm
  Properties:
    EvaluationPeriods: '2'
    Statistic: Average
    Threshold: '30'
    AlarmDescription: Alarm if Cluster Memory Reservation is too Low
    Period: '60'
    AlarmActions:
    - Ref: ScaleDownPolicy
    Namespace: AWS/ECS
    Dimensions:
    - Name: ClusterName
      Value: !Ref ECSCluster
    ComparisonOperator: LessThanThreshold
    MetricName: MemoryReservation

ECSLaunchConfiguration:
  Type: AWS::AutoScaling::LaunchConfiguration
  Properties:
    KeyName: !If [IsProd, !Ref 'AWS::NoValue', !Ref KeyName]
    ImageId: !Ref ECSAMI
    InstanceType: !Ref InstanceType
    SecurityGroups:
      - !Ref SecurityGroup
    IamInstanceProfile: !Ref ECSInstanceProfile
    UserData:
      "Fn::Base64": !Sub |
        #!/bin/bash
        source /etc/profile.d/proxy.sh
        yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm
        yum install -y https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm
        yum install -y aws-cfn-bootstrap hibagent
        cat >> /opt/aws/amazon-cloudwatch-agent/etc/common-config.toml <<EOF
        [proxy]
            http_proxy="${!http_proxy}"
            https_proxy="${!https_proxy}"
            no_proxy="${!no_proxy}"
        EOF
        /opt/aws/bin/cfn-init -v --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSLaunchConfiguration
        /opt/aws/bin/cfn-signal -e $? --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSAutoScalingGroup
        /usr/bin/enable-ec2-spot-hibernation

  Metadata:
    AWS::CloudFormation::Init:
      config:
        packages:
          yum:
            collectd: []

        commands:
          01_add_instance_to_cluster:
            command: !Sub echo ECS_CLUSTER=${ECSCluster} >> /etc/ecs/ecs.config
          02_enable_cloudwatch_agent:
            command: !Sub /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c ssm:${ECSCloudWatchParameter} -s
        files:
          /etc/cfn/cfn-hup.conf:
            mode: 000400
            owner: root
            group: root
            content: !Sub |
              [main]
              stack=${AWS::StackId}
              region=${AWS::Region}

          /etc/cfn/hooks.d/cfn-auto-reloader.conf:
            content: !Sub |
              [cfn-auto-reloader-hook]
              triggers=post.update
              path=Resources.ECSLaunchConfiguration.Metadata.AWS::CloudFormation::Init
              action=/opt/aws/bin/cfn-init -v --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSLaunchConfiguration

        services:
          sysvinit:
            cfn-hup:
              enabled: true
              ensureRunning: true
              files:
                - /etc/cfn/cfn-hup.conf
                - /etc/cfn/hooks.d/cfn-auto-reloader.conf

# This IAM Role is attached to all of the ECS hosts. It is based on the default role
# published here:
# http://docs.aws.amazon.com/AmazonECS/latest/developerguide/instance_IAM_role.html
#
# You can add other IAM policy statements here to allow access from your ECS hosts
# to other AWS services. Please note that this role will be used by ALL containers
# running on the ECS host.

ECSRole:
  Type: AWS::IAM::Role
  Properties:
    Path: /
    RoleName: !Sub ${EnvironmentName}-ECSRole-${AWS::Region}
    AssumeRolePolicyDocument: |
      {
          "Statement": [{
              "Action": "sts:AssumeRole",
              "Effect": "Allow",
              "Principal": {
                  "Service": "ec2.amazonaws.com"
              }
          }]
      }
    ManagedPolicyArns:
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/CSOPSRestrictionPolicy"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPIAMRestrictionPolicy"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPBasePolicy"
      - arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
      - arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
    Policies:
      - PolicyName: ecs-service
        PolicyDocument: |
          {
              "Statement": [{
                  "Effect": "Allow",
                  "Action": [
                      "ecs:CreateCluster",
                      "ecs:DeregisterContainerInstance",
                      "ecs:DiscoverPollEndpoint",
                      "ecs:Poll",
                      "ecs:RegisterContainerInstance",
                      "ecs:StartTelemetrySession",
                      "ecs:Submit*",
                      "ecr:BatchCheckLayerAvailability",
                      "ecr:BatchGetImage",
                      "ecr:GetDownloadUrlForLayer",
                      "ecr:GetAuthorizationToken"
                  ],
                  "Resource": "*"
              }]
          }

ECSInstanceProfile:
  Type: AWS::IAM::InstanceProfile
  Properties:
    Path: /
    Roles:
      - !Ref ECSRole

ECSServiceAutoScalingRole:
  Type: AWS::IAM::Role
  Properties:
    AssumeRolePolicyDocument:
      Version: "2012-10-17"
      Statement:
        Action:
          - "sts:AssumeRole"
        Effect: Allow
        Principal:
          Service:
            - application-autoscaling.amazonaws.com
    Path: /
    ManagedPolicyArns:
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/CSOPSRestrictionPolicy"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPIAMRestrictionPolicy"
      - !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPBasePolicy"
    Policies:
      - PolicyName: ecs-service-autoscaling
        PolicyDocument:
          Statement:
            Effect: Allow
            Action:
              - application-autoscaling:*
              - cloudwatch:DescribeAlarms
              - cloudwatch:PutMetricAlarm
              - ecs:DescribeServices
              - ecs:UpdateService
            Resource: "*"

ECSCloudWatchParameter:
  Type: AWS::SSM::Parameter
  Properties:
    Description: CloudWatch Log configs for ECS cluster
    Name: !Sub AmazonCloudWatch-${ECSCluster}-ECS
    Type: String
    Value: !Sub |
      {
        "logs": {
          "force_flush_interval": 5,
          "logs_collected": {
            "files": {
              "collect_list": [
                {
                  "file_path": "/var/log/messages",
                  "log_group_name": "${ECSCluster}/var/log/messages",
                  "log_stream_name": "{instance_id}",
                  "timestamp_format": "%b %d %H:%M:%S"
                },
                {
                  "file_path": "/var/log/dmesg",
                  "log_group_name": "${ECSCluster}/var/log/dmesg",
                  "log_stream_name": "{instance_id}"
                },
                {
                  "file_path": "/var/log/docker",
                  "log_group_name": "${ECSCluster}/var/log/docker",
                  "log_stream_name": "{instance_id}",
                  "timestamp_format": "%Y-%m-%dT%H:%M:%S.%f"
                },
                {
                  "file_path": "/var/log/ecs/ecs-init.log",
                  "log_group_name": "${ECSCluster}/var/log/ecs/ecs-init.log",
                  "log_stream_name": "{instance_id}",
                  "timestamp_format": "%Y-%m-%dT%H:%M:%SZ"
                },
                {
                  "file_path": "/var/log/ecs/ecs-agent.log.*",
                  "log_group_name": "${ECSCluster}/var/log/ecs/ecs-agent.log",
                  "log_stream_name": "{instance_id}",
                  "timestamp_format": "%Y-%m-%dT%H:%M:%SZ"
                },
                {
                  "file_path": "/var/log/ecs/audit.log",
                  "log_group_name": "${ECSCluster}/var/log/ecs/audit.log",
                  "log_stream_name": "{instance_id}",
                  "timestamp_format": "%Y-%m-%dT%H:%M:%SZ"
                }
              ]
            }
          }
        },
        "metrics": {
          "append_dimensions": {
            "AutoScalingGroupName": "${!aws:AutoScalingGroupName}",
            "InstanceId": "${!aws:InstanceId}",
            "InstanceType": "${!aws:InstanceType}"
          },
          "metrics_collected": {
            "collectd": {
              "metrics_aggregation_interval": 60
            },
            "disk": {
              "measurement": [
                "used_percent"
              ],
              "metrics_collection_interval": 60,
              "resources": [
                "/"
              ]
            },
            "mem": {
              "measurement": [
                "mem_used_percent"
              ],
              "metrics_collection_interval": 60
            },
            "statsd": {
              "metrics_aggregation_interval": 60,
              "metrics_collection_interval": 10,
              "service_address": ":8125"
            }
          }
        }
      }

ECSClusterParameter:
  Type: AWS::SSM::Parameter
  Properties:
    Description: !Sub ${EnvironmentName} - ECS Cluster
    Name: !Sub /${EnvironmentName}/ecs-cluster
    Type: String
    Value: !Ref ECSCluster

ECSServiceAutoScalingRoleParameter:
  Type: AWS::SSM::Parameter
  Properties:
    Description: !Sub ${EnvironmentName} - ECS Service ASG Role
    Name: !Sub /${EnvironmentName}/ecs-service-asg-role
    Type: String
    Value: !GetAtt ECSServiceAutoScalingRole.Arn

The Alarm activity history:

2019-12-26 11:40:54 Action  Successfully executed action arn:aws:autoscaling:ap-southeast-2:031539715286:scalingPolicy:95e836b6-2f56-498d-b931-7ec4184bedc4:autoScalingGroupName/ECS-UEBZA8GAP8S7-ECSAutoScalingGroup-1BIBTJH5I50W9:policyName/ECS-UEBZA8GAP8S7-ScaleUpPolicy-17LUWE42DC7EO
2019-12-26 11:40:54 State update  Alarm updated from OK to In alarm

Solution

  • Make sure there aren't any processes suspended. Alarm notification means that incoming alarms won't trigger scaling policies. Launch means even if the desired goes up nothing will be launched

    Other common issues that can cause this:

    • If you're using weights and increasing desired by 1, but the lowest weigh isn't 1, then it might never be able to scale.

    • Make sure there aren't any other scaling policies being triggered that might override this one

    • Check the activity history to make sure there aren't any healthcheck replacements constantly happening, since that would start a 5 minute cooldown (default since one isn't set on the ASG, only the scaling policy), and would block simple scaling policies

    • Make sure the desired isn't already at the Max

    • In addition to the alarm being triggered, make sure you see in the Alarm history that the autoscaling 'action' happened (The action actually happens every minute the alarm stays in the Alarm state, no mater what your evaluation settings, but only the first one gets posted to the Alarm history)

    • Check the ASG Activity history for launch failures, this is especially common if using spot instances, and the ASG will eventually enter a backoff state after enough failures. Any manual update to the group will reset this backoff