amazon-web-services amazon-ec2 amazon-ecs aws-step-functions

ECS / EC2 auto scaling doesn't deal with two tasks one after the other

I'm currently at my wits end trying to figure this out.

We have a step functions pipeline that runs tasks on a mixture of Fargate and EC2 ECS instances. They are all in the same cluster.

If we run a task that requires EC2, and we want to run another task afterwards that also uses EC2 we have to put a 20 minute Wait command in order for the second task to run successfully.

It doesn't seem to use existing EC2 instances, or scale up any more for when we run the second task? It gives the error of RESOURCE:MEMORY. I would expect it to scale up some more EC2 instances in order to match the demand, or to use the existing EC2 instances to run the tasks.

The ECS cluster has a capacity provider with managed scaling on, managed termination protection on and target capacity at 100%.

The ASG has a min capacity of 0, and a max capacity of 8. It has managed scaling on. Instance type is r5.4xlarge

Example step function that recreates the problem:

{
  "StartAt": "Set up variables",
  "States": {
    "Set up variables": {
      "Type": "Pass",
      "Next": "Map1",
      "Result": [
        1,
        2,
        3
      ],
      "ResultPath": "$.input"
    },
    "Map1": {
      "Type": "Map",
      "Next": "Map2",
      "ItemsPath": "$.input",
      "ResultPath": null,
      "Iterator": {
        "StartAt": "Inner1",
        "States": {
          "Inner1": {
            "ResultPath": null,
            "Type": "Task",
            "TimeoutSeconds": 2000,
            "End": true,
            "Resource": "arn:aws:states:::ecs:runTask.sync",
            "Parameters": {
              "Cluster": "arn:aws:ecs:CLUSTER_ID",
              "TaskDefinition": "processing-task",
              "NetworkConfiguration": {
                "AwsvpcConfiguration": {
                  "Subnets": [
                    "subnet-111"
                  ]
                }
              },
              "Overrides": {
                "Memory": "110000",
                "Cpu": "4096",
                "ContainerOverrides": [
                  {
                    "Command": [
                      "sh",
                      "-c",
                      "sleep 600"
                    ],
                    "Name": "processing-task"
                  }
                ]
              }
            }
          }
        }
      }
    },
    "Map2": {
      "Type": "Map",
      "End": true,
      "ItemsPath": "$.input",
      "Iterator": {
        "StartAt": "Inner2",
        "States": {
          "Inner2": {
            "ResultPath": null,
            "Type": "Task",
            "TimeoutSeconds": 2000,
            "End": true,
            "Resource": "arn:aws:states:::ecs:runTask.sync",
            "Parameters": {
              "Cluster": "arn:aws:ecs:CLUSTER_ID",
              "TaskDefinition": "processing-task",
              "NetworkConfiguration": {
                "AwsvpcConfiguration": {
                  "Subnets": [
                    "subnet-111"
                  ]
                }
              },
              "Overrides": {
                "Memory": "110000",
                "Cpu": "4096",
                "ContainerOverrides": [
                  {
                    "Command": [
                      "sh",
                      "-c",
                      "sleep 600"
                    ],
                    "Name": "processing-task"
                  }
                ]
              }
            }
          }
        }
      }
    }
  }
}

What I've tried so far:

I've tried changing the cooldown period for the EC2 instances, with a small amount of success. The only problem is that it now scales up too fast and we still have to wait before running more tasks, only we have to wait a shorter time.

Please let me know if what we want is possible and how to do it if it is Thank you

Solution

I very recently ran into a similar scenario with a Capacity Provider. Bursts of concurrent task placements via ECS run-task (invoked with a Lambda) were not returning task information in the response. Despite this, a task was queued in the PROCESSING state on the cluster where it would sit for some time and then eventually fail to start with the error RESOURCE:MEMORY.

Speculation: It seems that the problem is related to the capacity provider's refresh interval of CapacityProviderReservation: https://aws.amazon.com/blogs/containers/deep-dive-on-amazon-ecs-cluster-auto-scaling/.

CapacityProviderReservation needs to change in order for your cluster to scale out (or in) based on its Alarm, but bursts of task placements which exceed your total current capacity don't always seem to satisfy this requirement.

We were able to overcome this behavior of failing to place tasks by exponentially backing off and retrying the call to ECS run-task if the response contains an empty tasks[] collection. This has had only a minor impact on our task placement throughput and we haven't seen the problem reoccur, since.