Massive spike in stolen CPU

One of our t2.medium servers suddenly started showing up massive stolen CPU spikes. As far as I can tell, there was no sudden increase in traffic... they just started randomly on a server that has been running happily for several weeks.

At exactly the same time, our app (based on springboot) started reporting errors to do with our threadpool executor being full. Now, it could be that our threadpool had slowly been decaying and then suddenly got to a point where it started erroring (which would explain the errors we were seeing in our logs), however, I don't understand why Datadog would show this as "stolen CPU".

Two interesting things to note:

This is 1 of 4 servers behind a load balancer. The other servers were just fine and even after a reboot, this particular server continued to show the stolen CPU. However, when I stopped the server and allowed AWS to create a new instance, then the stolen CPU stopped.
The spikes here are in exactly 5 min intervals. I'm still trying to debug, but as far as I can tell, there's nothing in our app (or OS config) that triggers every 5 mins... but even if it did, why would it should up as stolen CPU.

Anyone seen anything like this before?

Thanks.

Solution

Looks to me like you're running that t2.medium instance pretty hot. You accrue 24 CPU credits per hour on that machine. You can read under "Burstable Instances" later on the same page that you can find an equation that essentially says a CPU credit is a "CPU Minute". So with 60 minutes in an hour, you get 24/60 or 40% CPU. Your baseline CPU is sitting 25-30% but it spikes up to 50% every 30 minutes.

Check the CPU credits that instance had available by the time you shut it off (even if the instance is gone from EC2 Instances list, its data will still be in cloudwatch under the instance ID. Graph all metrics for cpu credits and you should be able to spot the one that drops off when you termed the instance). I bet it was at 0.

You should upgrade to t3 instances; generally, each new generation of AWS instances has virtualization and thermal improvements that make them more cost effective for you. Even better, consider t3a instances; powered by AMD, they get a little more bang for your buck compared to Intel. If you really want to go all out, check out t4g instances. They're quite a bit more cost effective, but since they're Arm based processors you'll have to recompile any native compiled code.