CI cache unable to pass provider configurations through stages

I have usual stages in a terraform CI i.e. init > validate > plan etc. The first step i.e. init works fine always. But when we reach the next stage for e.g. validate I get following error:

$ terraform validate
103╷
104│ Error: Missing required provider
105│ 
106│ This configuration requires provider registry.terraform.io/datadog/datadog,
107│ but that provider isn't available. You may be able to install it
108│ automatically by running:
109│ terraform init
110╵

Now if a run init in the same stage as validate it works fine. So basically, a workaround is to either have all commands in one stage or have init at every stage, neither of which is ideal of course.
If I login to runner server and manually browse the .terraform directory the provider executable is there. But if I run terraform validate from shell it will again fail with the same error, however if I run init and then validate now it works.
No changes in .terraform directory and its contents before and after init. Same files, just updated creation datetimes.
If I go back to gitlab and re-run the validate stage which will fail but then I came back to server shell and do terraform validate again it will again fail, again no obvious changes in directory contents or permissions. Do init again and it will start working again.

As per my understanding the only difference between these stages is cache zip/unzip since .terraform folder is passed on as a cache.

In job console I can see following message:

Checking cache for terraform...
Runtime platform arch=amd64 os=linux pid=3798191 revision=90daeee0 version=14.7.0
No URL provided, cache will not be downloaded from shared cache server. Instead a local version of cache will be extracted. 
Successfully extracted cache

Another thing to notice is though downloaded modules are also present in .terraform it never throws an error regarding module but only about providers. I guess its something to do with .exe files?

config.toml:

[[runners]]
  name = "cicd_terraform"
  url = "***"
  token = "****"
  executor = "shell"
  [runners.custom_build_dir]

Earlier, an empty runners.cache section was there but situation was same so i removed it. I want it to use local directory as cache.

.gitlab-ci.yml:

cache:
  key: terraform
  paths:
    - .terraform


before_script:
  - echo -e "credentials \"$CI_SERVER_HOST\" {\n  token = \"$CI_JOB_TOKEN\"\n}" > $TF_CLI_CONFIG_FILE
  - cd ${TF_ROOT}
  - export TF_LOG_CORE=TRACE
  - export TF_LOG_PATH=${TF_ROOT}/terraform_logs.txt
  - ls -al
  - ls -al ${TF_ROOT}
  - echo "$TF_ROOT"


stages:
  - initialize
  - validate

init:
  stage: initialize
  script:
    - terraform -v
    - terraform init -backend-config="*****" -backend-config="*****.tfstate" -backend-config="*****-1" -backend-config="access_key=${AWS_ACCESS_KEY_ID}" -backend-config="secret_key=${AWS_SECRET_ACCESS_KEY}" -input=false -no-color

validate:
  stage: validate
  script:
    - terraform validate

ls -al ${TF_ROOT}/.terraform/providers/registry.terraform.io/datadog/datadog/2.24.0/linux_amd64

total 29256
drwxr-xr-x 2 gitlab-runner gitlab-runner     4096 Feb 19 01:36 .
drwxr-xr-x 3 gitlab-runner gitlab-runner     4096 Feb 19 01:36 ..
-rw-r--r-- 1 gitlab-runner gitlab-runner    48216 Feb 19 01:36 CHANGELOG.md
-rw-r--r-- 1 gitlab-runner gitlab-runner    16725 Feb 19 01:36 LICENSE
-rw-r--r-- 1 gitlab-runner gitlab-runner    12450 Feb 19 01:36 LICENSE-3rdparty.csv
-rw-r--r-- 1 gitlab-runner gitlab-runner     1524 Feb 19 01:36 README.md
-rwxr-xr-x 1 gitlab-runner gitlab-runner 29859840 Feb 19 01:36 terraform-provider-datadog_v2.24.0

Any idea, what am I doing wrong?

Solution

Two things must be true in order for Terraform to be able to find a particular provider:

The .terraform.lock.hcl file must specify a selected version for that provider, and the allowed plugin checksums for that version.
There must be a package for that selected version in .terraform/providers -- the local plugin cache directory -- which matches one of the checksums.

From what you shared it seems like the second of these is being handled by you passing the cache between steps using features of your CI system.

In order for the first to be true though, you'll need to run terraform init on your development machine in order to generate the .terraform.lock.hcl file and then check that file into version control as part of your configuration, which will hopefully then make your CI system place it in the right place as a normal part of checking out the source code.

When running terraform init in a non-interactive environment like this I would suggest adding the -lockfile=readonly option, which will cause Terraform to fail with an error if the lock file has become inconsistent with the rest of the configuration. That'll then allow your CI system to catch this problem early in the first step and return an explicit error about it, whereas in your current workflow terraform init can update the lock file itself but that then doesn't carry forward to the other steps, causing strange downstream errors.