I've been having this issue with Terraform on Azure: when I build some infra that is a bit more involved (multiple resource groups, storage accounts etc) it very often happens that apply fails for any number of reasons (e.g. resources temporary unavailable in the region).
The problem is, this abort usually happens only after TF already created Resource Group(s). Because of that, on the next run it/they already exists, and this time Terraform fails telling me that it has just tried to create resource group, but it already exists.
I know I could just import this into TF state, but it's quite cumbersome and failure-prone on its own. Luckily, this usually happens during initial run (on subsequent runs TF doesn't recreate RG, after all), so my "fix" is just to delete the RG and allow TF to create it again.
Still, I think there must be a better solution. Why TF cannot either roll back changes in case of failure, or at least update the state file with whatever was deployed successfully?
There isn't any rollback functionality in Terraform. Such automatic rollback could be more error-prone, e.g. missing provider configuration in the last working configuration if there was a new one added.
Terraform prefers "roll-forward" on errors so you can fix the config and just rerun terraform apply
.
In your case, as I can understand, some resources were created successfully before the error occurred but weren`t saved to the state file. This is the real problem. The provider should reflect all successful changes in the state file - this is called partially updating state. Provider should implement "on-error" logic to save all changes to the state file.
I would suggest creating an issue for the provider describing missing error handling in a scenario like yours.
After the provider implements the correct way of handling error, you could just rerun the terraform apply
because there will be no drift between your code, state, and real infrastructure. No rollback will be necessary.