I observed strange behaviour when updating a CloudFormation stack today and was wondering if I was doing something wrong. Simplified example below.
Repro Steps
Step 1: Create a stack with two queues and a single queue policy for both queues.
AWSTemplateFormatVersion: '2010-09-09'
Description: >
Cloudformation queue policy change bug repro step one: create stack with
this template.
Resources:
FirstQueue:
Type: "AWS::SQS::Queue"
SecondQueue:
Type: "AWS::SQS::Queue"
FirstPolicy:
Type: "AWS::SQS::QueuePolicy"
Properties:
Queues:
- !Ref FirstQueue
- !Ref SecondQueue
PolicyDocument:
Statement:
- Action:
- "SQS:SendMessage"
Effect: "Deny"
Principal: "*"
Step 2: Update the stack with a template where the original policy is only applied to the first queue and a new policy is applied to the second queue:
AWSTemplateFormatVersion: '2010-09-09'
Description: >
Cloudformation queue policy change bug repro step two: update stack with
this template.
Resources:
FirstQueue:
Type: "AWS::SQS::Queue"
SecondQueue:
Type: "AWS::SQS::Queue"
FirstPolicy:
Type: "AWS::SQS::QueuePolicy"
Properties:
Queues:
- !Ref FirstQueue
PolicyDocument:
Statement:
- Action:
- "SQS:SendMessage"
Effect: "Deny"
Principal: "*"
SecondPolicy:
Type: "AWS::SQS::QueuePolicy"
Properties:
Queues:
- !Ref SecondQueue
PolicyDocument:
Statement:
- Action:
- "SQS:ReceiveMessage"
Effect: "Deny"
Principal: "*"
Outcome
The outcome I would expect after the stack update is that each queue would have its own queue policy. Specifically, I would expect the second queue to have the new second policy applied to it. What I observe instead is that the second queue has an empty policy.
If I look at the event history in CloudTrail, I see CloudFormation makes two SetQueueAttributes
requests on the second queue during the stack update:
{
"eventVersion": "1.09",
"eventSource": "sqs.amazonaws.com",
"eventName": "SetQueueAttributes",
"sourceIPAddress": "cloudformation.amazonaws.com",
"userAgent": "cloudformation.amazonaws.com",
/* ... */
"requestParameters": {
"queueUrl": "https://sqs.ca-central-1.amazonaws.com/XXXXXXXXXXXX/BugRepro-SecondQueue-hZyO53RvsmdK",
"attributes": {
"Policy": ""
}
},
/* ... */
}
Question
To me, it seems like CloudFormation doesn't realize the policy is being replaced on the second queue so, instead of just setting the new policy, it both sets it (to the new one) and clears it (to remove the old one). Am I missing something here or doing something wrong? Is this behaviour expected?
I think this is caused by a simple race condition, and an unintuitive relationship between AWS::SQS::Queue
and AWS::SQS::QueuePolicy
.
CloudFormation works by identifying changes to the defined resources in your template, building out a directed acyclic graph (DAG) of these changes based on dependency relationships: if ResourceB
depends on ResourceA
, then CloudFormation will ensure that A
will be modified before B
. If CloudFormation isn't able to determine a dependency ordering between two resources, then it is free to modify them in parallel, which is where the race condition comes in.
But that's only part of the story. The more important part is that CloudFormation resources do not exactly match physical AWS resources.
More often CloudFormation resources correspond to AWS API calls -- often single API calls. For example, AWS::SQS::Queue
corresponds to the SQS CreateQueue API call. If you look at that API, you'll see that there's no place to specify the queue policy. Instead, you must call the SetQueueAttributes API to update the queue policy.
I have no idea why the CloudFormation developers decided to create a separate QueuePolicy
resource to do this rather than incorporating it into the Queue
resource. I suspect it's because many/most of the CloudFormation resource types are auto-generated from a common API definition. That might also explain why QueuePolicy
refers to Queue
, rather than the other way around (it's how the API works).
But the result is that when you removed the reference from FirstPolicy
to SecondQueue
, you caused CloudFormation to put the two policies on separate branches of the DAG, which allowed them to be performed in parallel. Which meant that the actual change to QueueTwo
depended on which of those two resources happened to be processed last.
One way to solve this problem would be to update the stack twice: for the first update you'd detach QueueTwo
from PolicyOne
, and in the second you'd attach it to PolicyTwo
. This means that QueueTwo
will not have a policy between those updates, which may be a problem if you've already deployed applications that expect that policy.
The alternative is to declare an explicit dependency relationship between PolicyOne
and PolicyTwo
using the DependsOn resource attribute. Specify that PolicyTwo
depends on PolicyOne
, and CloudFormation will order the updates correctly. Note that there's still a (very small) window in which QueueTwo
doesn't have a policy, because stack updates are not transactional.
In the long term, you don't need this relationship, so either (1) comment that it was needed to ensure a correct stack update, or (2) remove after the update completes successfully.