amazon-web-services amazon-sqs autoscaling aws-auto-scaling

AWS Autoscaling on CloudWatch SQS metric problem

In my aws account I have a ASG setup for my SQS consumer. It has a min capacity of 3 and max capacity of 8. The termination policy is set to "default". It has 2 simple scaling policies which are attached to a cloud watch alarm which monitors the size of the SQS queue.

Here is the threshold for the cloud watch alarm ApproximateNumberOfMessagesVisible >= 10 for 1 consecutive periods of 300 seconds for the metric dimensions.

When the cloud watch alarm state is "alarming" after 300 seconds then the ASG adds 1 instance until it hits the max capacity. Likewise, when the cloud watch alarm state is "ok" after 300 seconds then the ASG removes 1 instance until it hits the min capacity.

The ASG seems to scale up to max capacity with no issues. The problem I'm running into however, occurs when the ASG scales back down. When the alarm state goes from "alarming" back to "ok" the ASG just seems to randomly pick an instance to shutdown. This is a problem if the instance it is shutting down is currently processing an SQS message.

For example, if my SQS queue has 20 visible messages then my ASG will scale up, lets say to 8. Once the visible messages are below or equal to 10 the ASG will start to terminate instances from my ASG. But, it might pick a instance which is processing an SQS message. If it does, then that SQS message goes into my DLQ.

Has anyone run into this issue before?

Is there a way to configure the ASG to monitor the SQS length and only terminate instances which have finished processing a messages? Maybe when the SQS is "ok" and the instance has low CPU? Or, should I be setting the threshold in my cloud watch alarm to something like 2?

Solution

Your app needs to explicitly tell the asg an instance cannot currently be killed. Check out the docs for Instance scale-in protection.

You need to do something like this before starting to process the message:

aws autoscaling set-instance-protection --instance-ids i-5f2e8a0d --auto-scaling-group-name my-asg --protected-from-scale-in

Then process your message from the protected instance i-5f2e8a0d in autoscale group my-asg. Finally deactivate instance protection when your done processing with:

aws autoscaling set-instance-protection --instance-ids i-5f2e8a0d --auto-scaling-group-name my-asg --no-protected-from-scale-in

Once a machine is protected the ASG will be unable to terminate it. Once the protection is turned off the instance is available to be terminated and autoscaling will continue to scale as expected. If all the instances are protected autoscaling will not terminate any instances (so be careful you always turn off instance protection or you might get stuck fully scaled up).