amazon-ec2 autoscaling amazon-eks amazon-ec2-spot-market eksctl

Autoscaling group unable to allocate spot instances

I have an eks cluster with a nodegroup based on a mixed instance distribution launch template with the following configuration:

region: us-west-2  
instance_distribution: [p2.xlarge, p3.2xlarge, p2.8xlarge]  
max_price: 0.9  
on_demand_percentage_above_base_capacity: 0  
on_demand_base_capacity: 0  
spot_instance_pools: 2

When looking to scale the autoscaling from 0 to 1, the following issue was encountered by the cluster autoscaler:

Launching a new EC2 instance. Status Reason: Could not launch Spot Instances. SpotMaxPriceTooLow - Your Spot request price of 0.9 is lower than the minimum required Spot request fulfillment price of 0.918. Launching EC2 instance failed. 
At the time, the spot price of p3.2xlarge happened to be 0.918.

It appears that a request for a spot p3.2xlarge was made instead of making a request for an on-demand instance of p2.xlarge (even though the on-demand price of 0.9 for p2.xlarge was less than the spot price of 0.918 for p3.2xlarge). I expected an on-demand p2.xlarge instance to be allocated up instead of requesting a p3.2xlarge spot instance. Is it because I configured on_demand_percentage_above_base_capacity: 0?

More generally, I want to be able to configure the cluster to get spot instances of p2.xlarge and when not possible, request for on-demand. What is the best configuration to achieve my desired functionality?

Is the configuration of on_demand_percentage_above_base_capacity strictly enforced? If on_demand_percentage_above_base_capacity is set to 1 and my first instance is an on-demand instance, will my next few scaling requests be forced to yield spot instances only or is it more like a weightage with guidelines (e.g. if there are no spots available it will still fall back to on-demand rather than failing to fulfill the request)?

Solution

TL;DR AutoScaling will not failover to on demand when there is no spot available, but it will try failing over the spot capacity to other spot instance types and availability zones

I expected an on-demand p2.xlarge instance to be allocated up instead of requesting a p3.2xlarge spot instance. Is it because I configured on_demand_percentage_above_base_capacity: 0?

You're correct, when you set both on demand settings to 0, the AutoScaling Group (ASG) will never try to launch an on demand instance. When using a mixed instance policy, the ASG will first figure out the number of spot vs OnDemand instances to launch before any other decisions are made. See the AWS Doc for some detailed examples of how this works. https://docs.aws.amazon.com/autoscaling/ec2/userguide/asg-purchase-options.html#asg-instances-distribution

More generally, I want to be able to configure the cluster to get spot instances of p2.xlarge and when not possible, request for on-demand. What is the best configuration to achieve my desired functionality?

There's no way to do that with an ASG. If you configure the ASG to launch only spot instances, it won't failover to ondemand. Similarly, if you set it up to launch 50% spot, 50% on demand, it still wouldn't fail over to on demand if there were no spot. It would just launch the on demand half, then keep trying to launch the spot half. The best way to keep capacity issues from happening is to: 1) Enable more instance types and availability zones (since there is a different capacity pool per zone per instance type) 2) Don't set the maximum spot price. The spot price will never exceed the on demand price. You may also want to look at the new feature where you can add weights to the instance types you select: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-ec2-auto-scaling-supports-instance-weighting/

Is the configuration of on_demand_percentage_above_base_capacity strictly enforced?

Yes

If on_demand_percentage_above_base_capacity is set to 1 and my first instance is an on-demand instance, will my next few scaling requests be forced to yield spot instances only

Assuming you set on_demand_base_capacity: 0 If you set percent_above_base to 1, the ASG will make the first instance on demand (the ASG always rounds up towards more on demand), and the next 99 would be spot, followed by 1 on demand, etc. It will not failover to on demand when there is no spot capacity