amazon-web-services terraform amazon-ecs google-tag-manager autoscaling

auto scaling for ECS service

I have setup GTM server side following this guide: https://aws-solutions-library-samples.github.io/advertising-marketing/using-google-tag-manager-for-server-side-website-analytics-on-aws.html

I am using AWS ECS task definitions and services. Later, I use Snowbridge to send data from AWS kinesis to GTM (snowplow client) using HTTP post requests.

When the data volume is high, I occasionally get a 502 error from GTM. If I filter out the data and reduce the amount of data being forwarded to GTM, I no longer get the error. What can I change on my GTM side to ensure that high amounts of data can be handled? Can I use automatic scaling in ECS?

I have already used parameters like

deployment_maximum_percent = 200

deployment_minimum_healthy_percent = 50

but the problem persists.

This is how my GTM configuration roughly looks like:

resource "aws_ecs_cluster" "gtm" {
  name = "gtm"
  setting {
    name  = "containerInsights"
    value = "enabled"
  }
}

resource "aws_ecs_task_definition" "PrimaryServerSideContainer" {
  family                   = "PrimaryServerSideContainer"
  network_mode             = "awsvpc"
  requires_compatibilities = ["FARGATE"]
  cpu                      = 2048
  memory                   = 4096
  execution_role_arn       = aws_iam_role.gtm_container_exec_role.arn
  task_role_arn            = aws_iam_role.gtm_container_role.arn
  runtime_platform {
    operating_system_family = "LINUX"
    cpu_architecture        = "X86_64"
  }
  container_definitions = <<TASK_DEFINITION
  [
  {
    "name": "primary",
    "image": "gcr.io/cloud-tagging-10302018/gtm-cloud-image",
    "environment": [
      {
        "name": "PORT",
        "value": "80"
      },
      {
        "name": "PREVIEW_SERVER_URL",
        "value": "${var.PREVIEW_SERVER_URL}"
      },
      {
        "name": "CONTAINER_CONFIG",
        "value": "${var.CONTAINER_CONFIG}"
      }
    ],
    "cpu": 1024,
    "memory": 2048,
    "essential": true,
    "logConfiguration": {
          "logDriver": "awslogs",
          "options": {
            "awslogs-group": "gtm-primary",
            "awslogs-create-group": "true",
            "awslogs-region": "eu-central-1",
            "awslogs-stream-prefix": "ecs"
          }
        },
    "portMappings" : [
        {
          "containerPort" : 80,
          "hostPort"      : 80
        }
      ]
  }
]
TASK_DEFINITION
}


resource "aws_ecs_service" "PrimaryServerSideService" {
  name             = var.primary_service_name
  cluster          = aws_ecs_cluster.gtm.id
  task_definition  = aws_ecs_task_definition.PrimaryServerSideContainer.id
  desired_count    = var.primary_service_desired_count
  launch_type      = "FARGATE"
  platform_version = "LATEST"

  scheduling_strategy = "REPLICA"

  deployment_maximum_percent         = 200
  deployment_minimum_healthy_percent = 50

  network_configuration {
    assign_public_ip = true
    security_groups  = [aws_security_group.gtm-security-group.id]
    subnets          = data.aws_subnets.private.ids
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.PrimaryServerSideTarget.arn
    container_name   = "primary"
    container_port   = 80
  }

  lifecycle {
    ignore_changes = [task_definition]
  }
}

resource "aws_lb" "PrimaryServerSideLoadBalancer" {
  name               = "PrimaryServerSideLoadBalancer"
  internal           = false
  load_balancer_type = "application"
  security_groups    = [aws_security_group.gtm-security-group.id]
  subnets            = data.aws_subnets.public.ids

  enable_deletion_protection = false
}
....

I also tried adding these:

resource "aws_appautoscaling_target" "ecs_target" {
  max_capacity       = 4
  min_capacity       = 1
  resource_id        = "service/${aws_ecs_cluster.gtm.name}/${aws_ecs_service.PrimaryServerSideService.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

resource "aws_appautoscaling_policy" "ecs_policy" {
  name               = "scale-down"
  policy_type        = "StepScaling"
  resource_id        = aws_appautoscaling_target.ecs_target.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_target.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs_target.service_namespace

  step_scaling_policy_configuration {
    adjustment_type         = "ChangeInCapacity"
    cooldown                = 60
    metric_aggregation_type = "Maximum"

    step_adjustment {
      metric_interval_upper_bound = 0
      scaling_adjustment          = -1
    }
  }
}

but the 502 errors persists.

Solution

You are looking in the right direction, and there are only two things left to do:

You need to identify the metric to understand that there is need to scale up (more likely CPU usage)
Update your resource "aws_appautoscaling_policy" "ecs_policy" to scale based on the metric from p.1

Right now, your ecs_policy doesn't have any metric to scale on.

Here is example:

resource "aws_appautoscaling_policy" "ecs_target_cpu" {
  name               = "application-scaling-policy-cpu"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.ecs_service_target.resource_id
  scalable_dimension = aws_appautoscaling_target.ecs_service_target.scalable_dimension
  service_namespace  = aws_appautoscaling_target.ecs_service_target.service_namespace
  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value = 80
  }
  depends_on = [aws_appautoscaling_target.ecs_service_target]
}