Search code examples
amazon-web-servicesamazon-ec2

AWS EC2 spot concurent interruption notice


Context

I am hosting a computer vision API on AWS EC2. Every month, my API returns some 502/504/521 errors, and after some investigation, it turns out these are caused by spot interruptions.

My API is hosted on EC2 instances, using a mix of on-demand and spot instances to reduce costs (I'm not using a spot fleet). I implemented a Lambda function to handle spot interruption notices gracefully. This works well, except when I receive two spot interruption notices for two different instances in a short time span.

Issue Description

When there is a single interruption notice, I have 2 minutes to handle the instance termination and shut down the API on the instance gracefully. However, when I receive two interruption notices in a row:

  • The first interruption notice is handled as expected.
  • For the second instance, it is terminated 2 minutes after the FIRST interruption notice, not 2 minutes after its own notice.

This results in the second instance shutting down sometimes as soon as 10 seconds after its own interruption notice, which doesn't leave enough time to shut down gracefully.

Steps Taken

  • I have a Lambda function that handles interruption notices and initiates a graceful shutdown.
  • Each instance should handle its interruption notice independently, but it seems the timing is shared or incorrectly handled.

Example Scenario

  1. Instance A receives an interruption notice at 12:00:00 and is scheduled for termination at 12:02:00.
  2. Instance B receives an interruption notice at 12:01:50 but is terminated at 12:02:00, just 10 seconds later instead of at 12:03:50.

Question

Is this a known issue, or am I missing something key in handling multiple interruption notices? How can I ensure each instance has its full 2-minute window for a graceful shutdown?


Solution

  • You have a couple of options here.

    1. Spot interruptions are visible in local EC2 metadata, so you rework this to have a poller local to each machine that checks that value very X seconds and initiates a clean shutdown. There is no central mechanism here to cause a delay. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-instance-termination-notices.html mentions this.

    2. It sounds like your lambda is somehow waiting for the target instance to shutdown. If you have any sort of orchestration happening in response to a spot notice, you could instead have the lambda kick off a Step Function and then return -- it can now immediately service the next notice.

    Logs from your lambda might help troubleshoot the issue.