Search code examples
dockerasp.net-core-2.0amazon-ecs

AWS ECS docker containers crash randomly with Error code 137 for netcoreapp2.0


We have recently migrated our .net4.6 WEB API to netcoreapp2.0 We are using AWS ECS docker containers for deployment of our services.

Short time load test works fine. But long running load test shows that docker containers recyles with error code 137.

During the entire load test, memory and CPU utilization is normal ~30 % both.

As Error 137 is memory related following fixes have been tried.

  1. Changed garbage collection mode :

< ServerGarbageCollection>true< /ServerGarbageCollection> < ConcurrentGarbageCollection>true< /ConcurrentGarbageCollection>

  1. Migrated to netcore 2.0.3 , as it has some fixes for memory management.

FROM microsoft/dotnet:2.0.3-runtime

  1. Configured cgroup, as below were some error in docker logs

cgroup: docker-runc (3365) created nested cgroup for controller "memory" which has incomplete hierarchy support. Nested cgroups may change behavior in the future. [ 23.104548] cgroup: "memory" requires setting use_hierarchy to 1 on the root

Our ECS Task configurations are below :

  1. Number of running Tasks : 2 on 2 C4.xlarge EC2 behind ECS.
  2. Memory Soft limit : 2 Gb
  3. Have also validated our Healthcheck endpoint, which does not have any issues and responds fast. Even tried hard coding the healhcheck with 200 Ok

Some Docker logs : (Notice OOM killed is false, even there are no kernel level logs.)

 "State": {
        "Status": "exited",
        "Running": false,
        "Paused": false,
        "Restarting": false,
        "OOMKilled": false,
        "Dead": false,
        "Pid": 0,
        "ExitCode": 137,
        "Error": "",
        "StartedAt": "2018-02-12T06:15:00.481719209Z",
        "FinishedAt": "2018-02-12T07:13:02.962733905Z"
    },

Some weird observations, if we run load test directly on docker container ip and port. They work just fine. If we run them through ALB, crash behavior is observed.

Please let me know any other linux command which can give me actual reason for process termination or any possible fixes for above case.


Solution

  • Actual issue was due to thread starvation, as we were load testing, our application intensively uses threads. As load increases the containers were starving for threads and new requests were getting queued. Out of which healthcheck was also queued. Currently executing threads was not able to finish as they were also blocked for some reason. So docker was killing our containers due to healthcheck timeouts.

    We increased min thread count to 100 in Startup.cs to fix this issue.