I am creating an Auto Scaling Group (ASG) with a CloudFormation (CF) template, and have enabled both EC2 and ELB health checks. What I would like to see is that CF refrain from marking the ASG as a successful deployment until the ASG meets its minimum instance count when considering both the EC2 and ELB health checks. However, this is not the behavior I'm seeing.
If, for instance, the first instances in the ASG fail their EC2 health checks, the ASG creates new instances as expected until it meets its minimum requirement. In the meantime, the ASG is listed as a CREATE_IN_PROGRESS
resource from CF's perspective. If the ASG never meets its minimum healthy instance count, the ASG remains in CREATE_IN_PROGRESS
indefinitely. Although not an ideal outcome, at least it's obvious there's a problem and will eventually trigger human intervention.
If, however, the first instances pass their EC2 health checks but fail their ELB health checks, the ASG creates new instances as before until it meets its minimum requirement. In the meantime, however, the ASG is listed as CREATE_COMPLETE
from CF's perspective, even before the HealthCheckGracePeriod
is elapsed. Long after the CF stack has deployed and considered itself CREATE_COMPLETE
, the ASG is still cycling through instances that never meet the ELB health checks. This is the behavior I would like to prevent.
According to the Auto Scaling Health Check docs:
All instances in your Auto Scaling group start in the healthy state. Instances are assumed to be healthy unless Amazon EC2 Auto Scaling receives notification that they are unhealthy. This notification can come from one or more of the following sources: Amazon EC2, Elastic Load Balancing (ELB), or a custom health check.
This description matches the behavior I'm seeing, so it appears CF w/ASG is operating as designed/documented, but in my opinion, the eager reporting of health on one of the two configured criteria seems optimistic. I would expect both criteria to pass at least once before claiming success.
Here is an abbreviated CF template snippet for the ASG:
Resources:
MyAppScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
AutoScalingGroupName: !Ref InstanceName
CapacityRebalance: true
HealthCheckGracePeriod: 300 # seconds
HealthCheckType: ELB
LaunchTemplate:
LaunchTemplateId: !Ref MyAppLaunchTemplate
Version: !GetAtt MyAppLaunchTemplate.DefaultVersionNumber
MaxSize: !Ref MaxInstances
MinSize: !Ref MinInstances
TargetGroupARNs: [ !Ref MyAppTargetGroup ]
VPCZoneIdentifier: !Split [ ",", !Ref Subnets ]
If you add a timeout to the ASG resource in CF, this produces the desired behavior. Not only does it prevent the indefinite CREATE_IN_PROGRESS
from the first scenario where EC2 health checks fail, it also prevents the misleading CREATE_COMPLETE
in the second scenario where EC2 health checks pass, but ELB health checks fail.
Here is the updated template snippet with a 10 minute resource timeout applied:
Resources:
MyAppScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
AutoScalingGroupName: !Ref InstanceName
CapacityRebalance: true
HealthCheckGracePeriod: 300 # seconds
HealthCheckType: ELB
LaunchTemplate:
LaunchTemplateId: !Ref MyAppLaunchTemplate
Version: !GetAtt MyAppLaunchTemplate.DefaultVersionNumber
MaxSize: !Ref MaxInstances
MinSize: !Ref MinInstances
TargetGroupARNs: [ !Ref MyAppTargetGroup ]
VPCZoneIdentifier: !Split [ ",", !Ref Subnets ]
CreationPolicy:
ResourceSignal:
Count: !Ref MinInstances
Timeout: PT10M