Search code examples
azure-active-directoryterraformazure-databricksazure-data-lake-gen2terraform-provider-databricks

Mounting ADLS gen2 with AAD passthrough in Azure Databricks with Terraform


I am trying to mount my ADLS gen2 storage containers into DBFS, with Azure Active Directory passthrough, using the Databricks Terraform provider. I'm following the instructions here and here, but I'm getting the following error when Terraform attempts to deploy the mount resource:

Error: Could not find ADLS Gen2 Token

My Terraform code looks like the below (it's very similar to the example in the provider documentation) and I am deploying with an Azure Service Principal, which creates the Databricks workspace in the same module:

provider "databricks" {
  host                        = azurerm_databricks_workspace.this.workspace_url
  azure_workspace_resource_id = azurerm_databricks_workspace.this.id
}

data "databricks_node_type" "smallest" {
  local_disk = true

  depends_on = [azurerm_databricks_workspace.this]
}

data "databricks_spark_version" "latest" {
  depends_on = [azurerm_databricks_workspace.this]
}

resource "databricks_cluster" "passthrough" {
  cluster_name            = "terraform-mount"
  spark_version           = data.databricks_spark_version.latest.id
  node_type_id            = data.databricks_node_type.smallest.id
  autotermination_minutes = 10
  num_workers             = 1

  spark_conf = {
    "spark.databricks.cluster.profile"                = "serverless",
    "spark.databricks.repl.allowedLanguages"          = "python,sql",
    "spark.databricks.passthrough.enabled"            = "true",
    "spark.databricks.pyspark.enableProcessIsolation" = "true"
  }

  custom_tags = {
    "ResourceClass" = "Serverless"
  }
}

resource "databricks_mount" "mount" {
  for_each = toset(var.storage_containers)

  name       = each.value
  cluster_id = databricks_cluster.passthrough.id
  uri        = "abfss://${each.value}@${var.sa_name}.dfs.core.windows.net"

  extra_configs = {
    "fs.azure.account.auth.type"                   = "CustomAccessToken",
    "fs.azure.account.custom.token.provider.class" = "{{sparkconf/spark.databricks.passthrough.adls.gen2.tokenProviderClassName}}",
  }

  depends_on = [
    azurerm_storage_container.data
  ]
}

(For clarity's sake, azurerm_storage_container.data is a set of storage containers with names from var.storage_containers, which are created in the azurerm_storage_account with name var.sa_name; hence the URI.)

I feel like this error is due to a fundamental misunderstanding on my part, rather than a simple omission. My underlying assumption is that I can mount storage containers for the workspace, with AAD passthrough, as a convenience when I deploy the infrastructure in its entirety. That is, whenever users come to use the workspace, any new passthrough cluster will be able to use these mounts with zero setup.

I can mount storage containers manually, following the AAD passthrough instructions: Spin up a high-concurrency cluster with passthrough enabled, then mount with dbutils.fs.mount. This is while logged in to the Databricks workspace with my user identity (rather than the Service Principal). Is this the root of the problem; is a Service Principal not appropriate for this task?

(Interestingly, the Databricks runtime gives me exactly the same error if I try to access files on the manually created mount using a cluster without passthrough enabled.)


Solution

  • Yes, that's problem arise from the use of service principal for that operation. Azure docs for credentials passthrough says:

    You cannot use a cluster configured with ADLS credentials, for example, service principal credentials, with credential passthrough.