Search code examples
amazon-web-servicesaws-cloudformationamazon-iamamazon-ecs

ECS tasks stuck in PENDING when updating via cloudformation


I have issue with deploying ECS cluster while the the build is fine but when updating task in cloudformation. the ECSSerivce spins up 6 PENDING new task. But 6 old tasks are still RUNNING, sometimes it will start draining olds tasks and the deployment will work, but other times all the old tasks are never drained and ECSService just stuck in UPDATE_IN_PROGRESS. How do I trouble something like this ?

below is my template for the stack.

AWSTemplateFormatVersion: '2010-09-09'
Resources:
  ElasticLoadBalancer:
    Type: AWS::ElasticLoadBalancingV2::LoadBalancer
    Properties:
      SecurityGroups:
      - !Ref 'ELBSecurityGroup'
      Subnets:
      - !Ref 'InstanceSubnet'
      - !Ref 'SecondarySubnet'
      Scheme: internet-facing
  RedirectLoadBalancerListener:
    Type: AWS::ElasticLoadBalancingV2::Listener
    DependsOn: ECSServiceRole
    Properties:
      DefaultActions:
      - Type: forward
        TargetGroupArn: !Ref 'ECSTG'
      LoadBalancerArn: !Ref 'ElasticLoadBalancer'
      Port: '80'
      Protocol: HTTP
  RedirectLoadBalancerListenerRule:
    Type: AWS::ElasticLoadBalancingV2::ListenerRule
    DependsOn: RedirectLoadBalancerListener
    Properties:
      Actions:
      - Type: forward
        TargetGroupArn: !Ref 'ECSTG'
      Conditions:
      - Field: path-pattern
        Values:
        - /
      ListenerArn: !Ref 'RedirectLoadBalancerListener'
      Priority: '1'
  LoadBalancerListener:
    Type: AWS::ElasticLoadBalancingV2::Listener
    DependsOn: ECSServiceRole
    Properties:
      Certificates:
      - CertificateArn: !Ref 'SSLCertificateId'
      DefaultActions:
      - Type: forward
        TargetGroupArn: !Ref 'ECSTG'
      LoadBalancerArn: !Ref 'ElasticLoadBalancer'
      Port: '443'
      Protocol: HTTPS
  LoadBalancerListenerRule:
    Type: AWS::ElasticLoadBalancingV2::ListenerRule
    DependsOn: LoadBalancerListener
    Properties:
      Actions:
      - Type: forward
        TargetGroupArn: !Ref 'ECSTG'
      Conditions:
      - Field: path-pattern
        Values:
        - /
      ListenerArn: !Ref 'LoadBalancerListener'
      Priority: '1'
  ECSTG:
    DependsOn: ElasticLoadBalancer
    Type: AWS::ElasticLoadBalancingV2::TargetGroup
    Properties:
      HealthCheckIntervalSeconds: 6
      HealthCheckPath: /api/ping
      HealthCheckProtocol: HTTP
      HealthCheckTimeoutSeconds: 5
      HealthyThresholdCount: 2
      Port: 80
      Protocol: HTTP
      UnhealthyThresholdCount: 5
      VpcId: !Ref 'VPCId'
      TargetGroupAttributes:
      - Key: deregistration_delay.timeout_seconds
        Value: '20'
  AppSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: AppSecurityGroup
      SecurityGroupIngress:
      - IpProtocol: '-1'
        FromPort: '-1'
        ToPort: '-1'
        SourceSecurityGroupId: !Ref 'ELBSecurityGroup'
      VpcId: !Ref 'VPCId'
  Route53Entry:
    Type: AWS::Route53::RecordSetGroup
    Properties:
      HostedZoneName: !Join ['', [!Ref 'Route53HostedZone', .]]
      Comment: Zone apex alias targeted to myELB LoadBalancer.
      RecordSets:
      - Name: !Join [., [!Ref 'ApplicationHost', !Ref 'Route53HostedZone']]
        Type: A
        AliasTarget:
          HostedZoneId: !GetAtt [ElasticLoadBalancer, CanonicalHostedZoneID]
          DNSName: !GetAtt [ElasticLoadBalancer, DNSName]
  ELBSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: ELBSecurityGroup
      SecurityGroupIngress:
      - IpProtocol: tcp
        FromPort: '443'
        ToPort: '443'
        CidrIp: 0.0.0.0/0
      - IpProtocol: tcp
        FromPort: '80'
        ToPort: '80'
        CidrIp: 0.0.0.0/0
      VpcId: !Ref 'VPCId'
  CloudWatchAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      ActionsEnabled: true
      AlarmActions:
      - arn:aws:sns:us-east-1:6xxxxxxx:instance-alarm
      ComparisonOperator: LessThanOrEqualToThreshold
      Dimensions:
      - Name: LoadBalancer
        Value: !GetAtt [ElasticLoadBalancer, LoadBalancerFullName]
      - Name: TargetGroup
        Value: !GetAtt [ECSTG, TargetGroupFullName]
      EvaluationPeriods: 5
      MetricName: HealthyHostCount
      Namespace: AWS/ApplicationELB
      Period: 60
      Statistic: Maximum
      Threshold: 0
  LowOnCreditAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      ActionsEnabled: true
      AlarmActions:
      - arn:aws:sns:us-east-1:6xxxxxx:instance-alarm
      ComparisonOperator: LessThanThreshold
      Dimensions:
      - Name: AutoScalingGroupName
        Value: !Ref 'AutoScalingGroup'
      EvaluationPeriods: 1
      MetricName: CPUCreditBalance
      Namespace: AWS/EC2
      Period: 300
      Statistic: Average
      Threshold: 15
  Database:
    Type: AWS::RDS::DBInstance
    Properties:
      AllocatedStorage: '5'
      DBInstanceClass: db.t2.micro
      Engine: postgres
      BackupRetentionPeriod: 35
      EngineVersion: 9.5.2
      DBName: !If [RestoreDB, '', ekdb]
      MasterUsername: !Ref 'DBUser'
      MasterUserPassword: !Ref 'DBPassword'
      DBSecurityGroups:
      - !Ref 'DatabaseSecurityGroup'
      DBSubnetGroupName: !Ref 'DatabaseSubnetGroup'
      DBSnapshotIdentifier: !Ref 'DBSnapshot'
    DeletionPolicy: Snapshot
  DatabaseSecurityGroup:
    Type: AWS::RDS::DBSecurityGroup
    Properties:
      GroupDescription: DatabaseSecurityGroup
      DBSecurityGroupIngress:
      - EC2SecurityGroupId: !Ref 'AppSecurityGroup'
      EC2VpcId: !Ref 'VPCId'
  Redis:
    Type: AWS::ElastiCache::CacheCluster
    Properties:
      CacheNodeType: cache.t2.micro
      Engine: redis
      EngineVersion: 2.8.24
      NumCacheNodes: 1
      VpcSecurityGroupIds:
      - !Ref 'RedisSecurityGroup'
      CacheSubnetGroupName: !Ref 'RedisSubnetGroup'
  RedisSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: RedisSecurityGroup
      SecurityGroupIngress:
      - IpProtocol: tcp
        FromPort: '6379'
        ToPort: '6379'
        SourceSecurityGroupId: !Ref 'AppSecurityGroup'
      VpcId: !Ref 'VPCId'
  FrontendUser:
    Type: AWS::IAM::User
    Properties:
      Groups:
      - SynapseAppUsers
  BackendUser:
    Type: AWS::IAM::User
    Properties:
      Groups:
      - SynapseAppUsers
  FrontendUserAccessKey:
    Type: AWS::IAM::AccessKey
    Properties:
      UserName: !Ref 'FrontendUser'
  BackendUserAccessKey:
    Type: AWS::IAM::AccessKey
    Properties:
      UserName: !Ref 'BackendUser'
  S3BucketPolicy:
    Type: AWS::S3::BucketPolicy
    Properties:
      Bucket: !Ref 'S3Bucket'
      PolicyDocument:
        Statement:
        - Action: s3:GetObject
          Effect: Allow
          Resource: !Sub 'arn:aws:s3:::${S3Bucket}/*'
          Principal:
            AWS:
            - !GetAtt 'FrontendUser.Arn'
            - !GetAtt 'BackendUser.Arn'
        - Action: s3:PutObject
          Effect: Allow
          Resource: !Sub 'arn:aws:s3:::${S3Bucket}/*'
          Principal:
            AWS:
            - !GetAtt 'BackendUser.Arn'
        - Action: s3:PutObjectAcl
          Effect: Allow
          Resource: !Sub 'arn:aws:s3:::${S3Bucket}/*'
          Principal:
            AWS:
            - !GetAtt 'BackendUser.Arn'
        - Action:
          - s3:PutObjectAcl
          - s3:PutObject
          - s3:GetObject
          - s3:DeleteObject
          Effect: Allow
          Resource: !Sub 'arn:aws:s3:::${S3Bucket}/*'
          Principal:
            AWS:
            - arn:aws:iam::6xxxxxxx:user/filestack-v3-policy
  S3Bucket:
    Type: AWS::S3::Bucket
    Properties:
      AccessControl: AuthenticatedRead
      CorsConfiguration:
        CorsRules:
        - AllowedHeaders:
          - '*'
          AllowedMethods:
          - GET
          - PUT
          - POST
          AllowedOrigins:
          - '*'
          ExposedHeaders:
          - ETag
          MaxAge: 3000
    DeletionPolicy: Retain
  AppIamRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
        - Effect: Allow
          Principal:
            Service:
            - ec2.amazonaws.com
          Action:
          - sts:AssumeRole
      Path: /
      Policies:
      - PolicyName: app-iam-role
        PolicyDocument:
          Statement:
          - Effect: Allow
            Action:
            - ecs:*
            - ecr:*
            - sns:*
            - logs:*
            Resource: '*'
          - Effect: Allow
            Action:
            - s3:PutObject
            - s3:GetObject
            - s3:PutObjectAcl
            - s3:DeleteObject
            Resource: !GetAtt [S3Bucket, Arn]
  AppInstanceProfile:
    Type: AWS::IAM::InstanceProfile
    Properties:
      Path: /
      Roles:
      - !Ref 'AppIamRole'
  LaunchConfig:
    Type: AWS::AutoScaling::LaunchConfiguration
    Properties:
      AssociatePublicIpAddress: true
      ImageId: !FindInMap [AWSRegionToAMI, !Ref 'AWS::Region', AMIID]
      InstanceType: !If [IsExclusive, t2.medium, m4.large]
      IamInstanceProfile: !Ref 'AppInstanceProfile'
      SecurityGroups:
      - !Ref 'AppSecurityGroup'
      UserData: !Base64
        Fn::Join:
        - ''
        - - '#!/bin/bash -xe

            '
          - echo ECS_CLUSTER=
          - !Ref 'ECSCluster'
          - ' >> /etc/ecs/ecs.config

            '
          - 'yum install -y aws-cfn-bootstrap

            '
          - '/opt/aws/bin/cfn-signal -e $? '
          - '         --stack '
          - !Ref 'AWS::StackName'
          - '         --resource AutoScalingGroup '
          - '         --region '
          - !Ref 'AWS::Region'
  AutoScalingGroup:
    Type: AWS::AutoScaling::AutoScalingGroup
    Properties:
      LaunchConfigurationName: !Ref 'LaunchConfig'
      MinSize: 1
      MaxSize: 2
      DesiredCapacity: !If [IsExclusive, 1, 2]
      VPCZoneIdentifier:
      - !Ref 'InstanceSubnet'
      HealthCheckGracePeriod: 600
      HealthCheckType: ELB
    CreationPolicy:
      ResourceSignal:
        Timeout: PT15M
    UpdatePolicy:
      AutoScalingReplacingUpdate:
        WillReplace: 'true'
  DatabaseSubnetGroup:
    Type: AWS::RDS::DBSubnetGroup
    Properties:
      DBSubnetGroupDescription: Subnet Group for database
      SubnetIds:
      - !Ref 'SecondarySubnet'
      - !Ref 'InstanceSubnet'
  RedisSubnetGroup:
    Type: AWS::ElastiCache::SubnetGroup
    Properties:
      Description: Subnet Group for Redis
      SubnetIds:
      - !Ref 'SecondarySubnet'
      - !Ref 'InstanceSubnet'
  ECSCluster:
    Type: AWS::ECS::Cluster
  ECSService:
    DependsOn:
    - RedirectLoadBalancerListener
    - LoadBalancerListener
    - AutoScalingGroup
    Type: AWS::ECS::Service
    Properties:
      Cluster: !Ref 'ECSCluster'
      DesiredCount: !If [IsExclusive, 2, 6]
      Role: !Ref 'ECSServiceRole'
      TaskDefinition: !Ref 'TaskDefinition'
      LoadBalancers:
      - ContainerName: nginx
        ContainerPort: '80'
        TargetGroupArn: !Ref 'ECSTG'
  ECSServiceRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
        - Effect: Allow
          Principal:
            Service:
            - ecs.amazonaws.com
          Action:
          - sts:AssumeRole
      Path: /
      Policies:
      - PolicyName: ecs-service
        PolicyDocument:
          Statement:
          - Effect: Allow
            Action:
            - elasticloadbalancing:DeregisterInstancesFromLoadBalancer
            - elasticloadbalancing:DeregisterTargets
            - elasticloadbalancing:Describe*
            - elasticloadbalancing:RegisterInstancesWithLoadBalancer
            - elasticloadbalancing:RegisterTargets
            - ec2:Describe*
            - ec2:AuthorizeSecurityGroupIngress
            Resource: '*'
  TaskDefinition:
    Type: AWS::ECS::TaskDefinition
    Properties:
      ContainerDefinitions:
      - Name: frontend
        Memory: '256'
        MemoryReservation: '32'
        Image: !Sub '6xxxxxxx0.dkr.ecr.us-east-1.amazonaws.com/frontend:${ImageTag}'
        LogConfiguration:
          LogDriver: awslogs
          Options:
            awslogs-group: !Ref 'ECSLogGroup'
            awslogs-region: !Ref 'AWS::Region'
            awslogs-stream-prefix: '[frontend]'
      - Name: backend
        Memory: '1024'
        MemoryReservation: '256'
        Links:
        - xray-daemon
        Environment:
        - Name: NODE_ENV
          Value: prod
        - Name: AWS_XRAY_DAEMON_ADDRESS
          Value: "xray-daemon:2000"
        - Name: APPLICATION_URL
          Value: !Sub 'https://${ApplicationHost}.${Route53HostedZone}'
        - Name: ACCOUNTS_TOKEN
          Value: !Ref AccountsToken
        - Name: ACCOUNTS_URL
          Value: !Ref 'AccountsUrl'
        - Name: HEAP_APPLICATION_ID
          Value: '3901275559'
        - Name: HUBSPOT_API_KEY
          Value: !Ref 'HubspotApiKey'
        - Name: USER_POOL
          Value: !Ref 'UserPool'
        - Name: POOL_CLIENTS
          Value: !Ref 'PoolClients'
        - Name: JWKS
          Value: !Ref 'JWKS'
        - Name: DATABASE_URL
          Value: !Sub ['postgresql://${DBUser}:${DBPassword}@${Address}:${Port}/ekdb',
            {Address: !GetAtt [Database, Endpoint.Address], Port: !GetAtt [Database,
                Endpoint.Port]}]
        - Name: REDIS_URL
          Value: !Sub ['redis://${Address}:${Port}/', {Address: !GetAtt [Redis, RedisEndpoint.Address],
              Port: !GetAtt [Redis, RedisEndpoint.Port]}]
        - Name: S3_FRONTEND_USER_ACCESS_KEY_ID
          Value: !Ref 'FrontendUserAccessKey'
        - Name: S3_FRONTEND_USER_SECRET
          Value: !GetAtt [FrontendUserAccessKey, SecretAccessKey]
        - Name: S3_BACKEND_USER_ACCESS_KEY_ID
          Value: !Ref 'BackendUserAccessKey'
        - Name: S3_BACKEND_USER_SECRET
          Value: !GetAtt [BackendUserAccessKey, SecretAccessKey]
        - Name: S3_BUCKET_NAME
          Value: !Ref 'S3Bucket'
        - Name: UPLOAD_STRATEGY
          Value: S3
        - Name: ACCOUNT_ID
          Value: !Ref 'AccountId'
        - Name: CHECK_ACCOUNT_ID
          Value: !Ref 'CheckAccountId'
        - Name: SNS_TOPIC_ARN
          Value: !Ref 'SNSTopicArn'
        Image: !Sub '6xxxxxx.dkr.ecr.us-east-1.amazonaws.com/backend:${ImageTag}'
        LogConfiguration:
          LogDriver: awslogs
          Options:
            awslogs-group: !Ref 'ECSLogGroup'
            awslogs-region: !Ref 'AWS::Region'
            awslogs-stream-prefix: '[backend]'
      - Name: nginx
        Memory: '256'
        MemoryReservation: '32'
        Links:
        - frontend
        - backend
        - pdf_viewer
        - preview
        Image: !Sub '67xxxxxx.dkr.ecr.us-east-1.amazonaws.com/nginx:${ImageTag}'
        LogConfiguration:
          LogDriver: awslogs
          Options:
            awslogs-group: !Ref 'ECSLogGroup'
            awslogs-region: !Ref 'AWS::Region'
            awslogs-stream-prefix: '[nginx]'
        PortMappings:
        - ContainerPort: 80
      - Name: pdf_viewer
        Memory: '256'
        MemoryReservation: '32'
        Image: !Sub '6xxxxxxxxx.dkr.ecr.us-east-1.amazonaws.com/pdf_viewer:${ImageTag}'
        LogConfiguration:
          LogDriver: awslogs
          Options:
            awslogs-group: !Ref 'ECSLogGroup'
            awslogs-region: !Ref 'AWS::Region'
            awslogs-stream-prefix: '[pdf_viewer]'
      - Name: preview
        Memory: '256'
        MemoryReservation: '32'
        Image: !Sub '6xxxxxxxx.dkr.ecr.us-east-1.amazonaws.com/preview:${ImageTag}'
        LogConfiguration:
          LogDriver: awslogs
          Options:
            awslogs-group: !Ref 'ECSLogGroup'
            awslogs-region: !Ref 'AWS::Region'
            awslogs-stream-prefix: '[preview]'
      - Name: xray-daemon
        Memory: '256'
        MemoryReservation: '32'
        Image: 'amazon/aws-xray-daemon'
        PortMappings:
        - ContainerPort: 2000
          HostPort: 0
          Protocol: "udp"
        LogConfiguration:
          LogDriver: awslogs
          Options:
            awslogs-group: !Ref 'ECSLogGroup'
            awslogs-region: !Ref 'AWS::Region'
            awslogs-stream-prefix: '[xray-daemon]'
  ECSLogGroup:
    Type: AWS::Logs::LogGroup
Parameters:
  CheckAccountId:
    Type: String
    Description: Should user's account id be checked while logging in to the instance?
    Default: 'yes'
  Route53HostedZone:
    Type: String
  SSLCertificateId:
    Type: String
    Description: Pass SSL id from AWS Certificate Manager to pass to ELB
  ApplicationHost:
    Type: String
    Description: 'Host to be applied as follows: {host}.{Route53HostedZone}'
  DBUser:
    Type: String
    Description: Username that the database should be accessible with
  DBPassword:
    Type: String
    Description: Password that the database user should have
  HtpasswdEntry:
    Type: String
    Description: This is the file that should be htpasswd entry file
  DBSnapshot:
    Type: String
    Description: Database Snapshot ID if you want to restore DB from snapshot
    Default: ''
  VPCId:
    Type: String
    Description: VPC Id to assosiate instance to. Pass this if you want to hide the
      instances behind pre-existing VPC
    Default: vpc-355a6b51
  InstanceSubnet:
    Type: String
    Description: Subnet on which the instance should be set up. Required if VPCId
      is set
    Default: subnet-beb826c8
  SecondarySubnet:
    Type: String
    Description: Subnet on which the RDS and ElastiCache group will be set up as well.
      Required if VPCId is set
    Default: subnet-04e39239
  AccountId:
    Type: String
    Description: AccountId. used to filter out users from Auth0
  AccountsUrl:
    Type: String
    Description: Accounts url eg. https://app.getsynapse.com/
  SNSTopicArn:
    Type: String
    Description: ARN of SNS Topic that will be use to communicate between different
      parts of the infrastructure
  HubspotApiKey:
    Type: String
    Description: Hubspot api key
  UserPool:
    Type: String
    Description: Cognito UserPool
  PoolClients:
    Type: String
    Description: Cognito PoolClients
  JWKS:
    Type: String
    Description: Cognito JWKS
  ImageTag:
    Type: String
    Description: Tag of docker images
  AccountsToken:
    Type: String
    Description: Token used for authenticating with Accounts
Conditions:
  RestoreDB: !Not [!Equals [!Ref 'DBSnapshot', '']]
  IsExclusive: !Not [!Equals [!Ref 'AccountId', N/a]]
Outputs:
  InstanceURL:
    Value: !Join ['', ["https://", !Ref 'ApplicationHost', ., !Ref 'Route53HostedZone']]
Mappings:
  AWSRegionToAMI:
    us-east-1:
      AMIID: ami-a7a242da
    us-east-2:
      AMIID: ami-b86a5ddd
    us-west-1:
      AMIID: none
    us-west-2:
      AMIID: none
    eu-west-1:
      AMIID: none
    eu-central-1:
      AMIID: none
    ap-northeast-1:
      AMIID: none
    ap-southeast-1:
      AMIID: none
    ap-southeast-2:
      AMIID: none

Solution

  • Based on the comments the issue seems to be related to MaximumPercent and MinimumHealthyPercent parameters and their default values of 200 and 100 respectively:

    • MaximumPercent: If a service is using the rolling update (ECS) deployment type, the maximum percent parameter represents an upper limit on the number of tasks in a service that are allowed in the RUNNING or PENDING state during a deployment.

    • MinimumHealthyPercent: If a service is using the rolling update (ECS) deployment type, the minimum healthy percent represents a lower limit on the number of tasks in a service that must remain in the RUNNING state during a deployment.

    The default values of 200 and 100 mean that for a service of size of 6 tasks, during the deployment there will be 12 tasks running at one point. This seems too much for the container instances to accommodate.

    A proposed solution is to change the values to 150 and 50, resulting in total of 6 tasks running during deployment (3 new and 3 old) until deployment finishes.