Search code examples
amazon-web-servicesaws-cloudformationamazon-sqs

Queue policy removed from SQS queue on CloudFormation stack update


I observed strange behaviour when updating a CloudFormation stack today and was wondering if I was doing something wrong. Simplified example below.

Repro Steps

Step 1: Create a stack with two queues and a single queue policy for both queues.

AWSTemplateFormatVersion: '2010-09-09'
Description: >
  Cloudformation queue policy change bug repro step one: create stack with
  this template.
Resources:

  FirstQueue:
    Type: "AWS::SQS::Queue"
  
  SecondQueue:
    Type: "AWS::SQS::Queue"
  
  FirstPolicy:
    Type: "AWS::SQS::QueuePolicy"
    Properties: 
      Queues: 
        - !Ref FirstQueue
        - !Ref SecondQueue
      PolicyDocument: 
        Statement: 
          - Action: 
              - "SQS:SendMessage" 
            Effect: "Deny"
            Principal: "*"

Step 2: Update the stack with a template where the original policy is only applied to the first queue and a new policy is applied to the second queue:

AWSTemplateFormatVersion: '2010-09-09'
Description: >
  Cloudformation queue policy change bug repro step two: update stack with
  this template.
Resources:

  FirstQueue:
    Type: "AWS::SQS::Queue"
  
  SecondQueue:
    Type: "AWS::SQS::Queue"
  
  FirstPolicy:
    Type: "AWS::SQS::QueuePolicy"
    Properties: 
      Queues: 
        - !Ref FirstQueue
      PolicyDocument: 
        Statement: 
          - Action: 
              - "SQS:SendMessage" 
            Effect: "Deny"
            Principal: "*"
  
  SecondPolicy:
    Type: "AWS::SQS::QueuePolicy"
    Properties: 
      Queues: 
        - !Ref SecondQueue
      PolicyDocument: 
        Statement: 
          - Action: 
              - "SQS:ReceiveMessage"
            Effect: "Deny"
            Principal: "*"

Outcome

The outcome I would expect after the stack update is that each queue would have its own queue policy. Specifically, I would expect the second queue to have the new second policy applied to it. What I observe instead is that the second queue has an empty policy.

empty queue policy

If I look at the event history in CloudTrail, I see CloudFormation makes two SetQueueAttributes requests on the second queue during the stack update:

  1. one where it sets the "second policy" as I would expect; and
  2. a second one where it clears the policy:
{
    "eventVersion": "1.09",
    "eventSource": "sqs.amazonaws.com",
    "eventName": "SetQueueAttributes",
    "sourceIPAddress": "cloudformation.amazonaws.com",
    "userAgent": "cloudformation.amazonaws.com",
    
    /* ... */
    
    "requestParameters": {
        "queueUrl": "https://sqs.ca-central-1.amazonaws.com/XXXXXXXXXXXX/BugRepro-SecondQueue-hZyO53RvsmdK",
        "attributes": {
            "Policy": ""
        }
    },

    /* ... */
}

Question

To me, it seems like CloudFormation doesn't realize the policy is being replaced on the second queue so, instead of just setting the new policy, it both sets it (to the new one) and clears it (to remove the old one). Am I missing something here or doing something wrong? Is this behaviour expected?


Solution

  • I think this is caused by a simple race condition, and an unintuitive relationship between AWS::SQS::Queue and AWS::SQS::QueuePolicy.

    CloudFormation works by identifying changes to the defined resources in your template, building out a directed acyclic graph (DAG) of these changes based on dependency relationships: if ResourceB depends on ResourceA, then CloudFormation will ensure that A will be modified before B. If CloudFormation isn't able to determine a dependency ordering between two resources, then it is free to modify them in parallel, which is where the race condition comes in.

    But that's only part of the story. The more important part is that CloudFormation resources do not exactly match physical AWS resources.

    More often CloudFormation resources correspond to AWS API calls -- often single API calls. For example, AWS::SQS::Queue corresponds to the SQS CreateQueue API call. If you look at that API, you'll see that there's no place to specify the queue policy. Instead, you must call the SetQueueAttributes API to update the queue policy.

    I have no idea why the CloudFormation developers decided to create a separate QueuePolicy resource to do this rather than incorporating it into the Queue resource. I suspect it's because many/most of the CloudFormation resource types are auto-generated from a common API definition. That might also explain why QueuePolicy refers to Queue, rather than the other way around (it's how the API works).

    But the result is that when you removed the reference from FirstPolicy to SecondQueue, you caused CloudFormation to put the two policies on separate branches of the DAG, which allowed them to be performed in parallel. Which meant that the actual change to QueueTwo depended on which of those two resources happened to be processed last.

    One way to solve this problem would be to update the stack twice: for the first update you'd detach QueueTwo from PolicyOne, and in the second you'd attach it to PolicyTwo. This means that QueueTwo will not have a policy between those updates, which may be a problem if you've already deployed applications that expect that policy.

    The alternative is to declare an explicit dependency relationship between PolicyOne and PolicyTwo using the DependsOn resource attribute. Specify that PolicyTwo depends on PolicyOne, and CloudFormation will order the updates correctly. Note that there's still a (very small) window in which QueueTwo doesn't have a policy, because stack updates are not transactional.

    In the long term, you don't need this relationship, so either (1) comment that it was needed to ensure a correct stack update, or (2) remove after the update completes successfully.