Search code examples
amazon-web-servicesmemory-leaksjvmheap-memoryamazon-ecs

Inconsistent reads in AWS Cloudwatch vs JVM logs and high memory usage


I'm having issues to monitor the memory usage in my aws ECS (Fargate) cluster. I have seen in the cloudwatch dashboard that the memory seemed not correctly garbage collected and so the autoscaling policy continues to add new instances.

So I added these logs every time I press a button in my web application (the one deployed in aws ECS):

 public static void memoryCheck() {
    final long maxHeapSize = Runtime.getRuntime().maxMemory();
    final long currentHeapSize = Runtime.getRuntime().totalMemory();
    final long freeMemory = Runtime.getRuntime().freeMemory();
    final long usedMemory = currentHeapSize - freeMemory;
    final float usagePercentage = 100f * ((float)usedMemory / (float)maxHeapSize);

    log.info("max heap size: {}", maxHeapSize);
    log.info("current heap size: {}", currentHeapSize);
    log.info("free memory: {}", freeMemory);
    log.info("used memory: {}", usedMemory);
    log.info("usage percentage: {}%", usagePercentage);
}

I ran some computations and in the meantime I used this query in cloudwatch:

fields MemoryUtilized, TaskId
| stat avg(100* (MemoryUtilized/MemoryReserved)) as usedMemoryPercentage by bin(1m)
| limit 200

ecs cloudwatch graph And I compared these results with what I see in my application logs:

[2022-07-31 15:05:45,409] usage percentage: 4.680836%   
[2022-07-31 15:07:15,255] usage percentage: 28.25417%   
[2022-07-31 15:10:35,612] usage percentage: 9.401683%   
[2022-07-31 15:10:37,186] usage percentage: 12.7543%
[2022-07-31 15:11:45,724] usage percentage: 13.398229%   
[2022-07-31 15:11:46,983] usage percentage: 17.017136%   
[2022-07-31 15:12:02,926] usage percentage: 17.581753%   
[2022-07-31 15:12:04,195] usage percentage: 20.910671%   
[2022-07-31 15:12:05,142] usage percentage: 21.470863%   
[2022-07-31 15:12:06,411] usage percentage: 25.008553%   
[2022-07-31 15:12:07,426] usage percentage: 25.564808%   
[2022-07-31 15:12:08,733] usage percentage: 28.876656%   
[2022-07-31 15:12:09,682] usage percentage: 29.434296%   
[2022-07-31 15:12:11,009] usage percentage: 32.670258%   
[2022-07-31 15:12:11,884] usage percentage: 33.084824%   
[2022-07-31 15:12:13,361] usage percentage: 5.8098726%   
[2022-07-31 15:12:14,170] usage percentage: 6.2343636%   
[2022-07-31 15:12:15,369] usage percentage: 9.761842%   
[2022-07-31 15:12:16,261] usage percentage: 10.529714%   
[2022-07-31 15:12:17,516] usage percentage: 13.84009%   
[2022-07-31 15:12:18,319] usage percentage: 14.395365%   
[2022-07-31 15:12:19,648] usage percentage: 17.555946%   
[2022-07-31 15:12:38,523] usage percentage: 18.108969%   
[2022-07-31 15:12:39,829] usage percentage: 21.39381%   
[2022-07-31 15:15:00,594] usage percentage: 22.144833%   
[2022-07-31 15:15:02,086] usage percentage: 25.362867%   
[2022-07-31 15:20:40,696] usage percentage: 26.10088%   
[2022-07-31 15:20:41,987] usage percentage: 29.37069%   
[2022-07-31 15:20:53,282] usage percentage: 29.936934%   
[2022-07-31 15:20:54,626] usage percentage: 5.3704033%   
[2022-07-31 15:20:55,472] usage percentage: 5.906469%   
[2022-07-31 15:20:56,626] usage percentage: 9.137836%   
[2022-07-31 15:20:57,673] usage percentage: 9.693291%   
[2022-07-31 15:20:58,844] usage percentage: 13.019199%   
[2022-07-31 15:20:59,950] usage percentage: 13.584228%   
[2022-07-31 15:21:01,196] usage percentage: 16.747267%   
[2022-07-31 15:21:02,111] usage percentage: 17.29407%   
[2022-07-31 15:21:03,343] usage percentage: 20.61859%   
[2022-07-31 15:21:04,242] usage percentage: 21.172577%   
[2022-07-31 15:21:05,502] usage percentage: 24.336498%   
[2022-07-31 15:27:46,785] usage percentage: 25.272879%   
[2022-07-31 15:27:48,120] usage percentage: 28.3611%
[2022-07-31 15:27:50,839] usage percentage: 28.914537%   
[2022-07-31 15:27:52,085] usage percentage: 32.04773%   
[2022-07-31 15:27:54,059] usage percentage: 4.9811816%   
[2022-07-31 15:27:55,317] usage percentage: 8.339294%

As you can see in the application logs the situation seems ok, instead in cloudwatch it seems the memory is not garbage collected correctly and in general there are mismatches (example: look at 15:12:13)

What do you think can be the issue? Am I missing something?

P.S. The "experiment" has been done in ECS with 1 service and 1 task with autoscaling disabled And the java process is initialized with these params: "-XX:+UseParallelOldGC", "-Xms11G", "-Xmx11G"


Solution

  • First, you are comparing apples and oranges - used memory vs maxheap ration doesn't really tell you anything about the other ratio, even if the java process' memory consumption is mostly due to the heap.

    See apangin's comment - his answer in particular is great: Java using much more memory than heap size (or size correctly Docker memory limit)

    UPDATE: speaking now about heap, some things to try (if heap is not a problem at all, then stick to NMT tracking, pmap, etc. but instructions for how to use jcmd and JFR recording might still be useful)

    • Get the class histogram to understand the distribution of the objects consuming heap: jcmd <pid> GC.class_histogram
    • Take a heap dump (might be really big!) - this is what I use: jcmd <pid> GC.heap_dump -all heapdump."$(date '+%F_%H-%M-%S')".hprof
      • you can then analyze it with popular tools like Java Mission Control (JMC), Yourkit, etc.
    • Start a JFR recording, dump it, and analyze with JMC. You can, again, start it with jcmd: jcmd <pid> JFR.start.

    jcmd is your friend - check jcmd <pid> help.

    Gotcha: you might not have direct access to the aws machine to interact with the process - in that case, you can tweak java process start arguments and/or use JMX. You can even do that programmatically (I have a Clojure example of that).