Search code examples
asp.net-core.net-coregarbage-collectionaws-fargate

.NET Core application running on fargate with memory issues


We are running a .NET application in fargate via terraform where we specify CPU and memory in the aws_ecs_task_definition resource.

The service has just 1 task e.g.

 resource "aws_ecs_task_definition" "test" {
   ....
   cpu                      = 256
   memory                   = 512
   ....

From the documentation this is required for Fargate.

You can also specify cpu and memory in the container_definitions, but the documentation states that the field is optional, and as we are already setting values at the task level we did not set them here.

We have observed that our memory was growing after the tasks started, depending on application, sometimes quite fast and others over a period of time.

So we starting thinking we had a memory leak and went to profile using the dotnet-monitor tool as a sidecar.

As part of introducing the sidecar we set cpu and memory values for our .NET application at the container_definitions level.

After we done this, we have observed that our memory in our applications is behaving much better.

From .NET monitor traces we are seeing that when we set memory at the container_definitions level:

  1. Working Set is much smaller
  2. Gen 0/1/2 GC Count is above 1(GC occurring early)
  3. GC 0/1/2 Size is less
  4. GC Committed Bytes is smaller

So to summarize when we do not set memory at container_definitions level, memory continues to grow and no GC occurring until we are almost running out of memory.

When we set memory at container_definitions level, GC occurring regularly and memory not spiking up.

So we have a solution, but do not understand why this is the case. Would like to know why it is so


Solution

  • Might be useful for future reference, we spent a bit of time figuring this out.

    Described behavior happens because .NET doesn't (yet) understand all possible cgroups settings.

    When you set memory limit on a task level in ECS, AWS uses something called hierarchical_memory_limit, which .NET doesn't know about - hence incorrect available heap size estimation. When you set it on a container level, it uses (different) cgroups knobs that are correctly understood by .NET.

    If you don't want to specify memory limit at container level, another workaround is to use GCHeapHardLimit configuration setting to tell .NET the amount of memory available (set it to something like 80% of container memory limit to account for other memory usage).

    A nice blog post about it: https://aws.amazon.com/blogs/developer/configuring-net-garbage-collection-for-amazon-ecs-and-aws-lambda/

    Some links to related issues: https://github.com/dotnet/runtime/issues/83563 https://github.com/dotnet/runtime/issues/82815