How should Terraform provider handle resource error when it consists of multiple entities?

NOTE: I'm using the v2 SDK.

In my provider my 'resource' isn't a single API call.

My resource is actually multiple 'things'.

For example...

resource "my_resource" "example" {
    foo {
        ...
    }

    bar {
        ...
    }

    baz {
        ...
    }
}

The resource and each of the nested blocks are all separate 'things' that each have their own API calls.

So when 'creating' this resource I need to actually make multiple API calls. One API call to create the resource itself, then I need to make an API call to create a 'foo', then another API for 'bar', 'baz' etc. Finally, once those nested things are created I need to call my API one last time to activate my main resource.

The problem I've found is that if there's an error in the creation of one of the nested blocks, I'm finding the state is getting messed up and reflecting the 'planned' diff even though I return an error from the API call as part of the Create step.

I'm interested to know how other people are handling errors in a provider that has a structure like this?

I've tried using Partial(). I've also tried to trigger another Read of each 'thing' but although the final state data looks to be correct (when printing it as part of a debug run with trace logs), once I've done a read, because my 'Create' function has to return an error, the state data that's read is dropped and the original planned diff is persisted (I've even stopped returning an error altogether and tried to return just the result of the Read, which is successful, and STILL the state reflects the planned diff rather than the modified state after a Read).

Solution

Since you mentioned Partial I'm assuming for this answer that you are using the older SDKv2 rather than the modern Terraform provider framework.

The programming model for SDKv2 is for the action functions like Create to receive a mutable value representing the planned values, encapsulated in a schema.ResourceData object, and then the provider will modify that value through that wrapping object to make it describe the object that was actually created (or updated).

Terraform Core itself expects a provider to respond to the "apply" request by returning the closest possible representation of what was actually created in the remote system. If the value is returned without an error then Terraform will require that the object conforms to the plan and will raise an error saying that there's a bug in the provider if not. If the provider returns a value and an error then Terraform Core will propagate that error to the user and save whatever value was returned, as long as it matches the schema of the resource type.

Unfortunately this mismatch in models between Terraform Core and the SDK makes the situation you've described quite tricky: if you don't call d.Set at all in your Create function then by default the SDK will just return whatever values were in the plan, even if parts of it weren't actually created yet. To make your provider behave in the way that Terraform is expecting you'd need to do something like this:

At the beginning of Create, decode all of the nested block data into some local variables of data types that are useful for making the API calls you intend to make. For example, you might at this step decode the data from the ResourceData object into whatever struct types the underlying SDK expects.
Before you take any other actions, use d.Set to remove all of the blocks of the types that will require separate requests each. This means you'll need to pass an empty value of whatever type is appropriate for the type you chose for that block's value.
In your loop where you're gradually creating the separate objects that each block represents, gradually append the results into a growing set of objects representing the blocks you've already successfully created. Each time you add a new item to that set, call d.Set again to reset the attribute representing the appropriate block type to now include the object that you created.
If you get to the end without any errors then your attributes should now again describe all of the objects requested in the configuration and you can return without an error. If you encounter an error partway through then you can return that error and the SDK will automatically also return the partially-updated value encapsulated inside the ResourceData object.

If you return an accurate description of which of the blocks were created and exclude the ones that weren't then on the next plan the SDK logic should notice that some of the blocks declared in the configuration aren't present in the prior state and so it should propose to update the object to include those additional blocks. Your provider's Update function can then follow a similar principle as above to gradually append only the nested objects it successfully created, so that it'll once again return a complete set if successful or a partial set in case of any errors.

SDKv2 was optimized for the more common case where a single resource block represents a single remote API call, and so its default behavior deals with either fully-successful or fully-failed responses. Dealing with partial failure requires more subtlety that is difficult to represent in that SDK's API.

The newer Terraform Plugin Framework has a different design for these operations which separates the request data from the response data, thereby making it less confusing to return only a partial result. The Resource interface has a Create method which has a request object containing the config and the plan and a response object containing a representation of the final state.

It pre-populates the response state with the planned values similarly to SDKv2 to still handle that common case of entirely-failing vs. entirely-succeeding, but it does also allow totally overwriting that default with a locally-constructed object representing a partial result, to better support situations like yours where one resource in Terraform is representing a number of different fallible calls to the underlying API.