Search code examples
terraformamazon-cloudwatchamazon-ekskubectl

How to rollback changes on Amazon Elastic Kubernetes Service (EKS) using Terraform and mitigate the impact of an unintentional deployment


I'm currently facing the challenging task of rolling back changes resulting from an unintentional Terraform deployment, which had the unintended consequence of altering three separate EKS clusters, all situated within the same AWS region. This unexpected deployment has raised concerns and necessitates a thorough investigation to ensure that our infrastructure is restored to its intended state.

While AWS CloudWatch logs are one valuable resource for gaining insights into the impact of this unintended deployment, I am aware that a comprehensive approach is needed to address this situation. To that end, I kindly request guidance from the community on additional avenues for investigation and remediation. Your expertise and assistance in this matter are greatly appreciated.

Thank you for your support.


Solution

  • Sorry it has happened..

    In my answer I will make a couple of assumptions:

    1. You are using version control for your terraform (and have a trace of who changed what and when)
    2. You have access to the current terraform state.

    In these cases, my workflow would be as follows:

    1. Backup the terraform state (if it's local, just run the command below, if it's remote (i.e. blob storage), then you will have to do it manually). This is to ensure that if it gets any worse, you can at least rollback to the current state.
      cp terraform.tfstate terraform.tfstate.backup
      
    2. Rollback the terraform changes OR checkout the 'working' code. You may need to revert a branch, or a commit or a group of commits.
      git revert COMMIT_ID
      // OR
      git checkout <branch>
      
    3. Finally, run terraform apply on the 'working' state:
      terraform plan
      terraform apply
      

    This works most of the time for me in AKS and EKS (if you haven't done anything too drastic).

    In those extreme cases when it doesn't work, I take a bit more direct approach:

    1. Backup the tfstate (always do this)
    2. Connect to EKS (or AKS) and manually fix the problem (or if it is related to the AWS/Azure resources (for example: Network configuration), then go to AWS/Azure portal and manually fix it yourself)
    3. Then import those changed resources back into terraform by running terraform import

    That being said... Always run terraform plan and evaluate what's gonna change. [I once changed the k8s version, deployed the change, and it was irreversible without fully destroying & replacing the AKS cluster. It obviously involved some downtime, but the 1st approach I suggested still worked]


    To avoid this happening in the future, there a couple of suggestions i would have:

    1. Only run terraform apply on production from CI/CD, and use terraform plan alongside requirement to get someone else 'approval'. This approach is very effective if done correctly. Your CI/CD could be as follows:
      1. Somebody opens a PR, and your CI (or GitHub action, or whatever else you have) gets triggered and starts running.
      2. The CI runs terraform plan and you place that as a comment on the PR (amazing example is here )
      3. Then you require at least 1 other person to approve it (and that person should look at the generated plan)
      4. Once approved & merged, you have another CI job, which runs terraform plan & terraform apply. It's a whole lot of processes and a cultural shock (sometimes), but it reduces the risk of something bad happening exponentially.
    2. Canary / Production releases. I haven't had much experience with this, but on the paper it looks nice. The idea is, you have a 'Canary' release where you push your changes. Then you test on it, and if everything goes well you release to production. It's a 'gateway', which helps to mitigate some risks and problems early. There are some pros and cons to this:
      • Pros:
        • Canary release can be used to find out 'anomalies'
        • Promoting from 'Canary' to 'Production' would reduce the risk of failure.
      • Cons:
        • Pricing (you will have to pay double)
        • Complexity (CI/CD, and other configuration)

    It's a lot, but hope I helped! Always evaluate & test the suggestions and choose the one which best suits you and your company needs.


    Update: I just did a quick google search, and found somebody combining both of these approach (i.e. approval + canary/production)

    enter image description here