I'm currently facing the challenging task of rolling back changes resulting from an unintentional Terraform deployment, which had the unintended consequence of altering three separate EKS clusters, all situated within the same AWS region. This unexpected deployment has raised concerns and necessitates a thorough investigation to ensure that our infrastructure is restored to its intended state.
While AWS CloudWatch logs are one valuable resource for gaining insights into the impact of this unintended deployment, I am aware that a comprehensive approach is needed to address this situation. To that end, I kindly request guidance from the community on additional avenues for investigation and remediation. Your expertise and assistance in this matter are greatly appreciated.
Thank you for your support.
Sorry it has happened..
In my answer I will make a couple of assumptions:
In these cases, my workflow would be as follows:
cp terraform.tfstate terraform.tfstate.backup
git revert COMMIT_ID
// OR
git checkout <branch>
terraform plan
terraform apply
This works most of the time for me in AKS and EKS (if you haven't done anything too drastic).
In those extreme cases when it doesn't work, I take a bit more direct approach:
tfstate
(always do this)terraform import
That being said... Always run terraform plan
and evaluate what's gonna change. [I once changed the k8s version, deployed the change, and it was irreversible without fully destroying & replacing the AKS cluster. It obviously involved some downtime, but the 1st approach I suggested still worked]
To avoid this happening in the future, there a couple of suggestions i would have:
terraform apply
on production from CI/CD, and use terraform plan
alongside requirement to get someone else 'approval'.
This approach is very effective if done correctly. Your CI/CD could be as follows:
terraform plan
and you place that as a comment on the PR (amazing example is here )terraform plan & terraform apply
.
It's a whole lot of processes and a cultural shock (sometimes), but it reduces the risk of something bad happening exponentially.It's a lot, but hope I helped! Always evaluate & test the suggestions and choose the one which best suits you and your company needs.
Update: I just did a quick google search, and found somebody combining both of these approach (i.e. approval + canary/production)