I am hosting a computer vision API on AWS EC2. Every month, my API returns some 502/504/521 errors, and after some investigation, it turns out these are caused by spot interruptions.
My API is hosted on EC2 instances, using a mix of on-demand and spot instances to reduce costs (I'm not using a spot fleet). I implemented a Lambda function to handle spot interruption notices gracefully. This works well, except when I receive two spot interruption notices for two different instances in a short time span.
When there is a single interruption notice, I have 2 minutes to handle the instance termination and shut down the API on the instance gracefully. However, when I receive two interruption notices in a row:
This results in the second instance shutting down sometimes as soon as 10 seconds after its own interruption notice, which doesn't leave enough time to shut down gracefully.
12:00:00
and is scheduled for termination at 12:02:00
.12:01:50
but is terminated at 12:02:00
, just 10 seconds later instead of at 12:03:50
.Is this a known issue, or am I missing something key in handling multiple interruption notices? How can I ensure each instance has its full 2-minute window for a graceful shutdown?
You have a couple of options here.
Spot interruptions are visible in local EC2 metadata, so you rework this to have a poller local to each machine that checks that value very X seconds and initiates a clean shutdown. There is no central mechanism here to cause a delay. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-instance-termination-notices.html mentions this.
It sounds like your lambda is somehow waiting for the target instance to shutdown. If you have any sort of orchestration happening in response to a spot notice, you could instead have the lambda kick off a Step Function and then return -- it can now immediately service the next notice.
Logs from your lambda might help troubleshoot the issue.