Resource synchronization based on data from external source in Terraform

As part of my setup in Terraform, I create users based on JSON objects. I want to ensure user synchronization.

The JSON object is the response from a custom Python script which is getting data from an external API. Based on this JSON I'm creating users. I need to remove them also.

Python script (api.py):

import json

data = {
    "test_team": {
        "members": [
            {"email": "abc@gmail.com", "name": "abc"},
            {"email": "abcdef@gmail.com", "name": "abcdef"}
        ]
    }
}

output = json.dumps(data)
print(output)

Terraform:

data "external" "python_output" {
  program = ["python", "${path.module}/api.py"]
}

locals {
  json_data = jsondecode(data.external.python_output.result)
  
  unique_mails = distinct(flatten([
    for team_key, team_data in local.json_data : [
      for member in team_data["members"] : member["email"]
    ]
  ]))
}

resource "user" "user" {
  for_each = { for email in local.unique_mails : email => email }
  name     = each.key
  role     = "user"
}

If the user is not on the list (missing key in JSON) Terraform should synchronize the change and remove the user (destroy resource user for missing user based on JSON)

How can I achieve it based on best practices in Terraform?

Solution

The main issue with external data source is that the program output must be a JSON encoded map of string keys and string values:

The program must then produce a valid JSON object on stdout, which will be used to populate the result attribute exported to the rest of the Terraform configuration. This JSON object must again have all of its values as strings. On successful completion it must exit with status zero.

In your case, your output is slightly different as you have arrays, and nested maps):

{
  "test_team": {
    "members": [
      {
        "email": "abc@gmail.com",
        "name": "abc"
      },
      {
        "email": "abcdef@gmail.com",
        "name": "abcdef"
      }
    ]
  }
}

If you'd like to only use the external data source, you would have to format your output in your script to be a map of string keys and string values. I don't recommend doing that, as:

it will add extra complexity to your script just to comply with Terraform internals
you will need extra decoding on Terraform anyway to transform your string/string map into a nested struct with maps and arrays (jsondecode won't transform that natively)

In your case, I recommend using an intermediate file to store the result of your script:

import json

data = {
    "test_team": {
        "members": [
            {"email": "abc@gmail.com", "name": "abc"},
            {"email": "abcdef@gmail.com", "name": "abcdef"}
        ]
    }
}

output = json.dumps(data)
with open('api_result.json', 'w') as f:
    f.write(output)

print('{}')

Here, there are 2 things to note:

we write the JSON to a dedicated file (here api_result.json)
we return a valid JSON ({}) if the script succeeds, as expected by external data source docs

Once this script is modified, you can use the local_file data source to read that file. You will need to add an explicit dependency (data.external.python_script) to be sure the file is read after your script run. From there, you can load the contents of api_result.json inside a local variable (json_data) with jsondecode, and do your own logic. Here's the final result (with comments for explanations):

// python script api.py is writing the result to api_result.json
data "external" "python_script" {
  program = ["python", "${path.module}/api.py"]
}

// this is the output of api.py
data "local_file" "python_output" {
  filename = "${path.module}/api_result.json"

  // we add an explicit dependency as this file will exist or be up-to-date after api.py has run
  depends_on = [data.external.python_script]
}

locals {
  // json_data is based on local_file content
  json_data = jsondecode(data.local_file.python_output.content)

  unique_mails = distinct(flatten([
    for team_key, team_data in local.json_data : [
      for member in team_data["members"] : member["email"]
    ]
  ]))
}

resource "user" "user" {
  for_each = { for email in local.unique_mails : email => email }
  name     = each.key
  role     = "user"
}

This will create one user resource per unique mail. If your api.py script :

adds a new user: a new user resource will be created for that user
removes an existing user: the corresponding user resource will be destroyed

If you'd like to test that behavior, you can define a local_file resource:

// will create one file per email
// filename = email, and content is empty as we don't care
resource "local_file" "test" {
  for_each = { for email in local.unique_mails : email => email }

  filename = "${path.module}/${each.key}"
  content  = ""
}

And check there are as many files created in your current directory than unique emails.

Since api_result.json may contain sensitive data (emails), I strongly suggest you to add it in .gitignore. Anyway, terraform apply will create that file again, so as long as the user or CI/CD process running the terraform command can run the script (e.g. it has Python installed), this file will be regenerated.