Search code examples
amazon-web-servicesaws-cloudformationamazon-elbaws-auto-scaling

Prevent Auto Scaling Group From Reporting Success to CloudFormation Until Both EC2 and ELB Health Checks Pass At Least Once


I am creating an Auto Scaling Group (ASG) with a CloudFormation (CF) template, and have enabled both EC2 and ELB health checks. What I would like to see is that CF refrain from marking the ASG as a successful deployment until the ASG meets its minimum instance count when considering both the EC2 and ELB health checks. However, this is not the behavior I'm seeing.

If, for instance, the first instances in the ASG fail their EC2 health checks, the ASG creates new instances as expected until it meets its minimum requirement. In the meantime, the ASG is listed as a CREATE_IN_PROGRESS resource from CF's perspective. If the ASG never meets its minimum healthy instance count, the ASG remains in CREATE_IN_PROGRESS indefinitely. Although not an ideal outcome, at least it's obvious there's a problem and will eventually trigger human intervention.

If, however, the first instances pass their EC2 health checks but fail their ELB health checks, the ASG creates new instances as before until it meets its minimum requirement. In the meantime, however, the ASG is listed as CREATE_COMPLETE from CF's perspective, even before the HealthCheckGracePeriod is elapsed. Long after the CF stack has deployed and considered itself CREATE_COMPLETE, the ASG is still cycling through instances that never meet the ELB health checks. This is the behavior I would like to prevent.

According to the Auto Scaling Health Check docs:

All instances in your Auto Scaling group start in the healthy state. Instances are assumed to be healthy unless Amazon EC2 Auto Scaling receives notification that they are unhealthy. This notification can come from one or more of the following sources: Amazon EC2, Elastic Load Balancing (ELB), or a custom health check.

This description matches the behavior I'm seeing, so it appears CF w/ASG is operating as designed/documented, but in my opinion, the eager reporting of health on one of the two configured criteria seems optimistic. I would expect both criteria to pass at least once before claiming success.

Here is an abbreviated CF template snippet for the ASG:

Resources:
    MyAppScalingGroup:
        Type: AWS::AutoScaling::AutoScalingGroup
        Properties:
            AutoScalingGroupName: !Ref InstanceName
            CapacityRebalance: true
            HealthCheckGracePeriod: 300 # seconds
            HealthCheckType: ELB
            LaunchTemplate:
                LaunchTemplateId: !Ref MyAppLaunchTemplate
                Version: !GetAtt MyAppLaunchTemplate.DefaultVersionNumber
            MaxSize: !Ref MaxInstances
            MinSize: !Ref MinInstances
            TargetGroupARNs: [ !Ref MyAppTargetGroup ]
            VPCZoneIdentifier: !Split [ ",", !Ref Subnets ]

Solution

  • If you add a timeout to the ASG resource in CF, this produces the desired behavior. Not only does it prevent the indefinite CREATE_IN_PROGRESS from the first scenario where EC2 health checks fail, it also prevents the misleading CREATE_COMPLETE in the second scenario where EC2 health checks pass, but ELB health checks fail.

    Here is the updated template snippet with a 10 minute resource timeout applied:

    Resources:
        MyAppScalingGroup:
            Type: AWS::AutoScaling::AutoScalingGroup
            Properties:
                AutoScalingGroupName: !Ref InstanceName
                CapacityRebalance: true
                HealthCheckGracePeriod: 300 # seconds
                HealthCheckType: ELB
                LaunchTemplate:
                    LaunchTemplateId: !Ref MyAppLaunchTemplate
                    Version: !GetAtt MyAppLaunchTemplate.DefaultVersionNumber
                MaxSize: !Ref MaxInstances
                MinSize: !Ref MinInstances
                TargetGroupARNs: [ !Ref MyAppTargetGroup ]
                VPCZoneIdentifier: !Split [ ",", !Ref Subnets ]
            CreationPolicy:
                ResourceSignal:
                    Count: !Ref MinInstances  
                    Timeout: PT10M