databricks azure-databricks databricks-unity-catalog

Show object dependant on external location in unity catalog

while working with Databricks i had the need to remove an external location.

I runt SHOW EXTERNAL LOCATIONS with this results:

After this I runt DROP EXTERNAL LOCATION gold_prd

Recivieng the error:

Storage Location has dependent catalogs, schemas, managed or external tables/volumes; You may use force option to delete it but the managed storage data under this location cannot be purged by Unity Catalog anymore.

I have no catalog/schemas or other object under this location, but before forcing the operation is there a way to list all objects dependend from this location?

Solution

This is the ugly, slow and dirty solution I wrote to check for dependant objects, I think it could be extended to check also for volumes.

%python
import pandas as pd

check_locations = []

query = "SHOW EXTERNAL LOCATIONS"
result = spark.sql(query)
for row in result.collect():
  check_locations.append(row["url"])

#######################
#      catalogs       #
#######################

query = "SHOW CATALOGS"
result = spark.sql(query)
for row in result.collect():
  catalog_name = row["catalog"]
  query = f"DESCRIBE CATALOG EXTENDED {catalog_name}"
  result = spark.sql(query)
  df = pd.DataFrame(result.collect()).transpose()
  if not df.empty:
    df.columns = df.iloc[0]
    df = df[1:]
    if 'Storage Location' in df.columns:
      location = df.iloc[0]["Storage Location"]
      for check_location in check_locations:
        if check_location in location:
          print(catalog_name)
          print(f"\t{location}")

#######################
#       schemas       #
#######################

  query = f"SHOW SCHEMAS IN {catalog_name}"
  result = spark.sql(query)
  for row in result.collect():
    schema_name = row["databaseName"]
    query = f"DESCRIBE SCHEMA {catalog_name}.{schema_name}"
    result = spark.sql(query)
    df = pd.DataFrame(result.collect()).transpose()
    if not df.empty:
      df.columns = df.iloc[0]
      df = df[1:]
      if 'Location' in df.columns:
        location = df.iloc[0]["Location"]
        for check_location in check_locations:
          if check_location in location:
            print(catalog_name)
            print(f'\t{schema_name}')
            print(f"\t\t{location}")
        
#######################
#       tables        #
#######################

    if schema_name != 'information_schema':      
      query = f"SHOW TABLES IN {catalog_name}.{schema_name}"
      result = spark.sql(query)
      for row in result.collect():
        table_name = row["tableName"]
        query = f"DESCRIBE detail  {catalog_name}.{schema_name}.{table_name}"
        result = spark.sql(query)
        for row in spark.sql(query).collect():
          print
          location = row["location"]
          for check_location in check_locations:
            if check_location in location:
              print(catalog_name)
              print(f'\t{schema_name}')
              print(f'\t\t{table_name}')
              print(f"\t\t\t{location}")