Search code examples
amazon-web-servicesterraformdatabricksterraform-provider-databricks

Error: Databricks API requires you to set `host` property


Related question: Terraform Databricks AWS instance profile - "authentication is not configured for provider"

After resolving the error in that question and proceeding, I have started encountering the following error on multiple different operations (create databricks instance profile, query terraform databricks data sources like databricks_current_user or databricks_spark_version) etc:

Error: cannot create instance profile: Databricks API (/api/2.0/instance-profiles/add) requires you to set `host` property (or DATABRICKS_HOST env variable) to result of `databricks_mws_workspaces.this.workspace_url`. This error may happen if you're using provider in both normal and multiworkspace mode. Please refactor your code into different modules. Runnable example that we use for integration testing can be found in this repository at https://registry.terraform.io/providers/databrickslabs/databricks/latest/docs/guides/aws-workspace

I am able to create an instance profile manually in the Databricks workspace admin console and am able to create clusters and run notebooks in it.

Relevant code:


main.tf:
module "create-workspace" {
  source = "./modules/create-workspace"

  env     = var.env
  region  = var.region
  databricks_host = var.databricks_host
  databricks_account_username = var.databricks_account_username
  databricks_account_password = var.databricks_account_password
  databricks_account_id = var.databricks_account_id
}

providers-main.tf:
terraform {
  required_version = ">= 1.1.0"

    required_providers {
        databricks = {
            source  = "databrickslabs/databricks"
            version = "0.4.4"
        }
        aws = {
            source = "hashicorp/aws"
            version = ">= 3.49.0"
        }
    }
}

provider "aws" {
  region = var.region
  profile = var.aws_profile
}

provider "databricks" {
  host  = var.databricks_host
  token = var.databricks_manually_created_workspace_token
}

modules/create-workspace/providers.tf:
terraform {
  required_version = ">= 1.1.0"

    required_providers {
        databricks = {
            source  = "databrickslabs/databricks"
            version = "0.4.4"
        }
        aws = {
            source = "hashicorp/aws"
            version = ">= 3.49.0"
        }
    }
}

provider "aws" {
  region = var.region
  profile = var.aws_profile
}

provider "databricks" {
  host  = var.databricks_host
  # token = var.databricks_manually_created_workspace_token - doesn't make a difference switching from username/password to token
  username = var.databricks_account_username
  password = var.databricks_account_password
  account_id = var.databricks_account_id
}

provider "databricks" {
  alias    = "mws"
  # host     = 
  username = var.databricks_account_username
  password = var.databricks_account_password
  account_id = var.databricks_account_id
}

modules/create-workspace/databricks-workspace.tf:
resource "databricks_mws_credentials" "this" {
  provider         = databricks.mws
  account_id       = var.databricks_account_id
  role_arn         = aws_iam_role.cross_account_role.arn
  credentials_name = "${local.prefix}-creds"
  depends_on       = [aws_iam_role_policy.this]
}

resource "databricks_mws_workspaces" "this" {
  provider        = databricks.mws
  account_id      = var.databricks_account_id
  aws_region      = var.region
  workspace_name  = local.prefix
  deployment_name = local.prefix

  credentials_id           = databricks_mws_credentials.this.credentials_id
  storage_configuration_id = databricks_mws_storage_configurations.this.storage_configuration_id
  network_id               = databricks_mws_networks.this.network_id

}

modules/create-workspace/IAM.tf:
data "databricks_aws_assume_role_policy" "this" {
  external_id = var.databricks_account_id
}

resource "aws_iam_role" "cross_account_role" {
  name               = "${local.prefix}-crossaccount"
  assume_role_policy = data.databricks_aws_assume_role_policy.this.json
}

resource "time_sleep" "wait" {
  depends_on = [
  aws_iam_role.cross_account_role]
  create_duration = "10s"
}

data "databricks_aws_crossaccount_policy" "this" {}

resource "aws_iam_role_policy" "this" {
  name   = "${local.prefix}-policy"
  role   = aws_iam_role.cross_account_role.id
  policy = data.databricks_aws_crossaccount_policy.this.json
}

data "aws_iam_policy_document" "pass_role_for_s3_access" {
  statement {
    effect    = "Allow"
    actions   = ["iam:PassRole"]
    resources = [aws_iam_role.cross_account_role.arn]
  }
}

resource "aws_iam_policy" "pass_role_for_s3_access" {
  name   = "databricks-shared-pass-role-for-s3-access"
  path   = "/"
  policy = data.aws_iam_policy_document.pass_role_for_s3_access.json
}

resource "aws_iam_role_policy_attachment" "cross_account" {
  policy_arn = aws_iam_policy.pass_role_for_s3_access.arn
  role       = aws_iam_role.cross_account_role.name
}

resource "aws_iam_instance_profile" "shared" {
  name = "databricks-shared-instance-profile"
  role = aws_iam_role.cross_account_role.name
}

resource "databricks_instance_profile" "shared" {
  instance_profile_arn = aws_iam_instance_profile.shared.arn
  depends_on = [databricks_mws_workspaces.this]
}


Solution

  • In this case, the problem is that you need to have two Databricks providers:

    1. for provisioning of the Databricks workspace itself - it uses Account ID, username and password
    2. for provisioning of resources inside the Databricks workspace - it uses host & token

    One of these providers needs to be declared with alias so Terraform can distinguish one from another. Documentation for Databricks provider shows how to do that. But the problem is that Terraform tries to apply all changes in parallel as much as possible, because it doesn't know about dependencies between resources, until you explicitly use depends_on, and tries to create Databricks resources before it knows about host value for Databricks workspace (even if it's already created).

    Unfortunately, it's not possible to put depends_on into the provider block. So current recommendation to avoid such problem is to to split code into several modules:

    1. Module that creates a Databricks workspace and returns host & token
    2. Module that creates Databricks objects with provider initialized from received host/token

    Also, Terraform doc recommends that initialization of providers didn't happen in modules - it's better to declare all providers with aliases inside the top-level template, and pass providers to modules explicitly (see example below). In this case module should have only declaration of required modules, but not their configuration.

    For example, top-level template could look like this:

    terraform {
      required_version = ">= 1.1.0"
    
        required_providers {
            databricks = {
                source  = "databrickslabs/databricks"
                version = "0.4.5"
            }
        }
    }
    
    provider "databricks" {
      host  = var.databricks_host
      token = var.token
    }
    
    provider "databricks" {
      alias    = "mws"
      host     = "https://accounts.cloud.databricks.com"
      username = var.databricks_account_username
      password = var.databricks_account_password
      account_id = var.databricks_account_id
    }
    
    module "workspace" {
      source    = "./workspace"
      providers = {
        databricks = databricks.workspace  
    }}
    
    
    module "databricks" {
      depends_on = [ module.workspace ]
      source    = "./databricks"
      # No provider block required as we're using default provider
    }
    

    and module itself like this:

    terraform {
      required_version = ">= 1.1.0"
    
        required_providers {
            databricks = {
                source  = "databrickslabs/databricks"
                version = ">= 0.4.4"
            }
        }
    }
    
    resource "databricks_cluster" {
    ...
    }