Search code examples
javamemoryjvmheap-memorydatadog

What would be optimal thresholds for JVM Heap Memory alerts in DataDog?


I'm working on creating Datadog alerts to alert teams when their services are on a certain percentage of JVM Heap Memory. However, I have a hard time figuring out what thresholds would be somewhat accurate. I can't seem to find accurate indicators on Oracle docs. and various other resources. The thresholds I have now are somewhat based on an older health monitoring service we used, that is now deprecated. That service used a constant heap memory size variable of 1024MB.

The services have a different Heap Memory size, so there is no constant value. This is one of the queries I'm using. This one in specific is for over allocation of heap memory for a service.

  query = "avg(last_1m):max:jvm.heap_memory{service:*-service ,env:production} by {service} / max:jvm.heap_memory_max{service:*-service, env:production} by {service} * 100 < 40"

My idea was something like this (pseudo code):

if heap in use % < 40% {
message:"over alloc, GCs can kill the app"
type:"critical"
}
if heap in use % > 70% AND < 85% {
message:"Monitor for potential memory issues"
type:"warning"
}
if heap in use %  > 85% {
message:"Risk of performance issues due to longer GC. Possible OOM Exceptions" 
type:"critical"
}

Please excuse any mistakes, I'm still learning and trying to understand GCs and JVM Heap Memory.


Solution

  • Based on my experience, I think the parameters you are using are just fine as general settings. Maybe, if you want to be a little bit more conservative I would suggest to set the critical alarm at a value above 80% (But this is just me being cautious).

    Having said that, these values are generally good but there may be small differences from one application to another. This means that in some cases you may want to check the application current behaviour and speak with the specialists on the team before configuring the alarm.