amazon-web-services amazon-ec2 amazon-ecs aws-fargate

AWS ECS task still running despite no spot instances available

While leaarning about AWS ECS I created a Fargate spot cluster and defined a single service with a single task on it. I can see that a Spot request for EC2 instances was created automatically, according to the configuration I provided when spinning up the Fargate cluster.

Right now I can still access the application, even though Spot request's history specifies the MaxSpotInstanceCountExceeded since 1 second after my instance was activated - the error is recurring every few minutes for the last 6 hours.

I read that Fargate Spot will try to spin up an alternative spot instance if a termination notification is sent to the one I'm using currently, but I don't understand how is it possible that my application is still running since Spot request is in error status.

Does Fargate use a different strategy for keeping the spot instances running than I thought? I didn't provide any additional capacity providers to my cluster
In addition to the first question, are there any availability guarantees when using Fargare spot launch type?
How can I verify if my cluster is in fact using spot instances? I wasn't able to identify this information in AWS console nor using aws-cli. The only indicator I could find was a common subnet used by spot request and my ECS cluster
Why am I getting the MaxSpotInstanceCountExceeded error? I didn't spin any other spot instances so I'm surprised that Fargate exhausted the spot instance limit. Or maybe there is a different cause to this issue?
Can I modify my Fargate Spot cluster to extend the desirable EC2 types to minimize the spot-instance-unavailability issue?

Solution

It would be helpful if you could share a little bit more data about what commands you are running or where you are seeing this information, but just to clarify a few things:

Does Fargate use a different strategy for keeping the spot instances running than I thought? I didn't provide any additional capacity providers to my cluster

Fargate maintains pools of spot capacity to run Fargate Spot tasks. These pools are maintained by the Fargate service and is not something you see in your account. When a customer wants to run a Spot task, an instance is allocated from the pool to run the task in question.

This instance behaves like any other Spot instance and can be reclaimed by EC2 Spot at any time with a 2-minute warning provided before the task is terminated and the instance is returned to Spot. Of course, if the task completes before the instance is reclaimed by EC2 Spot, the task will run to completion as usual.

In addition to the first question, are there any availability guarantees when using Fargare spot launch type?

No, there are no availability guarantees when using Spot in any form (EC2 or Fargate). The whole point with Spot is that you get access to otherwise unused compute capacity if there is any available and that the compute capacity can be reclaimed at any time with a 2-minute warning. This is the reason why Spot is so much cheaper than regular on-demand usage.

Why am I getting the MaxSpotInstanceCountExceeded error? I didn't spin any other spot instances so I'm surprised that Fargate exhausted the spot instance limit. Or maybe there is a different cause to this issue?

Where are you getting this error. As mentioned above, Fargate manages the compute capacity used to run all Fargate tasks, on-demand and Spot alike, so the number of Fargate Spot tasks you run has no impact on the number of EC2 Spot instances you can run outside of Fargate. You will also not see any EC2 Spot instances in your account when running Fargate Spot tasks as the corresponding spot instances live in the Fargate service accounts.

Can I modify my Fargate Spot cluster to extend the desirable EC2 types to minimize the spot-instance-unavailability issue?

No, you have no ability to influence what instance types are used when using Fargate.

Fargate Spot tasks failing to launch due to Spot capacity not being available is very rare (looking at the service metrics), if you can send me a corresponding task id and information about the region where you saw this I can ask the team to look at it.