Search code examples
amazon-web-servicesamazon-ec2cloud

How much time does it take to finish a job on AWS spot instances?


This question might seem very general, but it might be helpful when we want to choose between spot and on-demand instances.

I'm new to AWS.

Let's say I have a job that I want to run on a spot instance with an expected interruption rate of alpha (5 percent for example). On an on-demand instance of the same type, it take t minutes to complete the job. Now I have the following questions:

  1. What does the interruption rate exactly mean? It must have something to do with the duration of using the instance. For example, if you use the spot instance for 1 hour, there is less chance to be interrupted compared to a situation where you plan to use it for 2 hours. But interruption rate is expressed as a fixed number. What does it exactly represent?
  2. If the instance gets interrupted, is there an option for automatically submitting a request for a new spot instance of the same type in the same availability zone? In that case how much time does it take on average to get a new instance? Can it be estimated using the interruption rate (In other words, is the interruption period somehow correlated to interruption rate)?
  3. If I restrict myself to only one type of instance in one availability zone, can I somehow estimate the amount of time it takes to finish the job on a spot instance? I suppose that I know the amount of time it takes to finish the job on an on-demand instance of the same type, plus, I save the job's status very frequently.

Solution

  • To answer your question in details let's follow the concepts in peaces:

    1. What does the interruption rate exactly mean? It must have something to do with the duration of using the instance. For example, if you use the spot instance for 1 hour, there is less chance to be interrupted compared to a situation where you plan to use it for 2 hours. But interruption rate is expressed as a fixed number. What does it exactly represent?

    When you go to https://aws.amazon.com/ec2/spot/instance-advisor/ it would show you some details about the instances type and its interruption rates. Interruption rates or Frequency of interruption means that from all spot instances that were requested in the last 30 days (regardless of duration) around x% of time, they had to be reclaimed.

    From: https://aws.amazon.com/ec2/spot/instance-advisor/ Frequency of interruption represents the rate at which Spot has reclaimed capacity during the trailing month. They are in ranges of < 5%, 5-10%, 10-15%, 15-20% and >20%.

    If you had to picture this, Imagine that you take a train with 50% ticket discount with the only caveat that if the train gets full and another person who paid the full price of the ticket wants to go in, you have to get out of the train (so you allow this person to get in without modifying capacity). Finally, if 4 out of 20 trips you take (regardless of distance traveled) would represent 20% of your trips were "reclaimed".

    Conclusion: it doesn't matter how long your job needs to run, if you choose a specific ec2 size for your spot, this has it's percentage rate at which you might get interrupted.

    enter image description here

    enter image description here

    • in us-east-1 i3.large spot instances have >20% chance of getting reclaimed
    • in us-east-1 i4g.large spot instances have <5% change of getting reclaimed

    given similarities in specs, you should choose i4g.large

    1. If the instance gets interrupted, is there an option for automatically submitting a request for a new spot instance of the same type in the same availability zone? In that case how much time does it take on average to get a new instance? Can it be estimated using the interruption rate (In other words, is the interruption period somehow correlated to interruption rate)?

    There is no evidence or documentation on that but, in theory, if your instance were reclaimed, that also means there are more "on-demand" instances allocated so there are no more free instances of that type. You can request a new spot of that exact same type, but the time to get that new spot would only depend on some other instance (either on-demand or another spot) to finish it's job giving one free space for you. There is no way to calculate that time on that specific moment.

    If you had to picture that too, imagine that train we talked about before is at peak hour traffic, you don't have a way to tell when the peak time would be over, so if you want that exact same train, you'll have to wait indefinitely.

    If I restrict myself to only one type of instance in one availability zone, can I somehow estimate the amount of time it takes to finish the job on a spot instance? I suppose that I know the amount of time it takes to finish the job on an on-demand instance of the same type, plus, I save the job's status very frequently.

    Restricting your job to one type of instance would do the opposite, as you'll have fewer choices to take in order to allow other instance types to be available for you.

    Picturing this, it would be better for you to have different train options whenever you get kicked out of one specific type, so you might have a chance to take a longer route to your destination but at least keep moving instead of waiting for that exact same train keeping you stuck at that station.

    This also applies no only for same instance types but regions, but whenever you change regions you might also need to consider transfer data costs so... keep that in mind.

    One more thing, you might want to also consider changing from spot to on-demand instances whenever you don't have more spot instances available for you (like if you've been waiting for 5 minutes). This should allow you to better predict your job duration since it's going to benefit from spot only if available.