Search code examples
pysparkdatabricksazure-databricksdatabricks-sqldatabricks-unity-catalog

Databricks shared access mode limitations


I have an interactive cluster where notebooks attached to them must be able to read/write data from Unity Catalog and also DBFS and ADLS. I've set up this cluster as USER_ISOLATION (shared mode). When reading data from dbfs or abfss with spark, I've got an error:

SparkConnectGrpcException: (org.apache.spark.SparkSecurityException) [INSUFFICIENT_PERMISSIONS]
Insufficient privileges: User does not have permission SELECT on any file.

I've granted permissions to my user and still getting this error.

resource "databricks_sql_permissions" "any_file" {
  any_file = true
  catalog  = true

  privilege_assignments {
    principal  = "..."
    privileges = ["SELECT"]
  }
}

Documentations says:

Cannot use R, RDD APIs, or clients that directly read the data from cloud storage, such as DBUtils.

Can I assume this is the root cause of INSUFFICIENT_PERMISSIONS error?

Although I understand the reasons behind the decision to block access to the file system, we still need to access both while the migration to the unity catalog is happening.

Cluster configuration:

{
    "num_workers": 1,
    "cluster_name": "...",
    "spark_version": "14.0.x-scala2.12",
    "spark_conf": {
        "spark.hadoop.fs.azure.account.oauth2.client.endpoint": "...",
        "spark.hadoop.fs.azure.account.auth.type": "...",
        "spark.hadoop.fs.azure.account.oauth.provider.type": "...",
        "spark.hadoop.fs.azure.account.oauth2.client.id": "...",
        "spark.hadoop.fs.azure.account.oauth2.client.secret": "..."
    },
    "node_type_id": "...",
    "driver_node_type_id": "...",
    "ssh_public_keys": [],
    "spark_env_vars": {
        "cluster_type": "all-purpose"
    },
    "init_scripts": [],
    "enable_local_disk_encryption": false,
    "data_security_mode": "USER_ISOLATION",
    "cluster_id": "..."
}

Is there any workaround, solution or different approach? Changing the cluster configuration to SINGLE_USER is not something we want at this moment, as the same setup is being shared by multiple users/notebooks.


Solution

  • You're correct about listed limitations. But when you're using Unity Catalog, especially with shared clusters, you need to think a bit differently than before. UC + shared clusters provide very good users isolation, not allowing to access data without necessary access control (DBFS doesn't have access control at all, and ADLS provides access control only on the file level).

    You will need to change your approach:

    • For accessing data on ADLS via abfss you need to create external locations and grant corresponding permissions to your users.
    • Instead of using DBFS (it's not recommended for non-temporary data anyway), give users the possibility to use Unity Catalog Volumes - they could be used for unstructured data, config files, libraries, etc. You will need to migrate data from DBFS to UC Volumes using single user clusters, but it should be one time activity.

    You can read about latest features of the UC shared clusters in this blog post.

    P.S. You may look onto the UCX tool from Databricks Labs - it's intended for assisted automatic migration to Unity Catalog.