amazon-web-services memory-leaks jvm heap-memory amazon-ecs

Inconsistent reads in AWS Cloudwatch vs JVM logs and high memory usage

I'm having issues to monitor the memory usage in my aws ECS (Fargate) cluster. I have seen in the cloudwatch dashboard that the memory seemed not correctly garbage collected and so the autoscaling policy continues to add new instances.

So I added these logs every time I press a button in my web application (the one deployed in aws ECS):

 public static void memoryCheck() {
    final long maxHeapSize = Runtime.getRuntime().maxMemory();
    final long currentHeapSize = Runtime.getRuntime().totalMemory();
    final long freeMemory = Runtime.getRuntime().freeMemory();
    final long usedMemory = currentHeapSize - freeMemory;
    final float usagePercentage = 100f * ((float)usedMemory / (float)maxHeapSize);

    log.info("max heap size: {}", maxHeapSize);
    log.info("current heap size: {}", currentHeapSize);
    log.info("free memory: {}", freeMemory);
    log.info("used memory: {}", usedMemory);
    log.info("usage percentage: {}%", usagePercentage);
}

I ran some computations and in the meantime I used this query in cloudwatch:

fields MemoryUtilized, TaskId
| stat avg(100* (MemoryUtilized/MemoryReserved)) as usedMemoryPercentage by bin(1m)
| limit 200

And I compared these results with what I see in my application logs:

[2022-07-31 15:05:45,409] usage percentage: 4.680836%   
[2022-07-31 15:07:15,255] usage percentage: 28.25417%   
[2022-07-31 15:10:35,612] usage percentage: 9.401683%   
[2022-07-31 15:10:37,186] usage percentage: 12.7543%
[2022-07-31 15:11:45,724] usage percentage: 13.398229%   
[2022-07-31 15:11:46,983] usage percentage: 17.017136%   
[2022-07-31 15:12:02,926] usage percentage: 17.581753%   
[2022-07-31 15:12:04,195] usage percentage: 20.910671%   
[2022-07-31 15:12:05,142] usage percentage: 21.470863%   
[2022-07-31 15:12:06,411] usage percentage: 25.008553%   
[2022-07-31 15:12:07,426] usage percentage: 25.564808%   
[2022-07-31 15:12:08,733] usage percentage: 28.876656%   
[2022-07-31 15:12:09,682] usage percentage: 29.434296%   
[2022-07-31 15:12:11,009] usage percentage: 32.670258%   
[2022-07-31 15:12:11,884] usage percentage: 33.084824%   
[2022-07-31 15:12:13,361] usage percentage: 5.8098726%   
[2022-07-31 15:12:14,170] usage percentage: 6.2343636%   
[2022-07-31 15:12:15,369] usage percentage: 9.761842%   
[2022-07-31 15:12:16,261] usage percentage: 10.529714%   
[2022-07-31 15:12:17,516] usage percentage: 13.84009%   
[2022-07-31 15:12:18,319] usage percentage: 14.395365%   
[2022-07-31 15:12:19,648] usage percentage: 17.555946%   
[2022-07-31 15:12:38,523] usage percentage: 18.108969%   
[2022-07-31 15:12:39,829] usage percentage: 21.39381%   
[2022-07-31 15:15:00,594] usage percentage: 22.144833%   
[2022-07-31 15:15:02,086] usage percentage: 25.362867%   
[2022-07-31 15:20:40,696] usage percentage: 26.10088%   
[2022-07-31 15:20:41,987] usage percentage: 29.37069%   
[2022-07-31 15:20:53,282] usage percentage: 29.936934%   
[2022-07-31 15:20:54,626] usage percentage: 5.3704033%   
[2022-07-31 15:20:55,472] usage percentage: 5.906469%   
[2022-07-31 15:20:56,626] usage percentage: 9.137836%   
[2022-07-31 15:20:57,673] usage percentage: 9.693291%   
[2022-07-31 15:20:58,844] usage percentage: 13.019199%   
[2022-07-31 15:20:59,950] usage percentage: 13.584228%   
[2022-07-31 15:21:01,196] usage percentage: 16.747267%   
[2022-07-31 15:21:02,111] usage percentage: 17.29407%   
[2022-07-31 15:21:03,343] usage percentage: 20.61859%   
[2022-07-31 15:21:04,242] usage percentage: 21.172577%   
[2022-07-31 15:21:05,502] usage percentage: 24.336498%   
[2022-07-31 15:27:46,785] usage percentage: 25.272879%   
[2022-07-31 15:27:48,120] usage percentage: 28.3611%
[2022-07-31 15:27:50,839] usage percentage: 28.914537%   
[2022-07-31 15:27:52,085] usage percentage: 32.04773%   
[2022-07-31 15:27:54,059] usage percentage: 4.9811816%   
[2022-07-31 15:27:55,317] usage percentage: 8.339294%

As you can see in the application logs the situation seems ok, instead in cloudwatch it seems the memory is not garbage collected correctly and in general there are mismatches (example: look at 15:12:13)

What do you think can be the issue? Am I missing something?

P.S. The "experiment" has been done in ECS with 1 service and 1 task with autoscaling disabled And the java process is initialized with these params: "-XX:+UseParallelOldGC", "-Xms11G", "-Xmx11G"

Solution

First, you are comparing apples and oranges - used memory vs maxheap ration doesn't really tell you anything about the other ratio, even if the java process' memory consumption is mostly due to the heap.

See apangin's comment - his answer in particular is great: Java using much more memory than heap size (or size correctly Docker memory limit)

UPDATE: speaking now about heap, some things to try (if heap is not a problem at all, then stick to NMT tracking, pmap, etc. but instructions for how to use jcmd and JFR recording might still be useful)

Get the class histogram to understand the distribution of the objects consuming heap: jcmd <pid> GC.class_histogram
Take a heap dump (might be really big!) - this is what I use: jcmd <pid> GC.heap_dump -all heapdump."$(date '+%F_%H-%M-%S')".hprof
- you can then analyze it with popular tools like Java Mission Control (JMC), Yourkit, etc.
Start a JFR recording, dump it, and analyze with JMC. You can, again, start it with jcmd: jcmd <pid> JFR.start.

jcmd is your friend - check jcmd <pid> help.

Gotcha: you might not have direct access to the aws machine to interact with the process - in that case, you can tweak java process start arguments and/or use JMX. You can even do that programmatically (I have a Clojure example of that).