Search code examples
databricksazure-databricksdatabricks-unity-catalog

Show object dependant on external location in unity catalog


while working with Databricks i had the need to remove an external location.

I runt SHOW EXTERNAL LOCATIONS with this results:

image

After this I runt DROP EXTERNAL LOCATION gold_prd

Recivieng the error:

Storage Location has dependent catalogs, schemas, managed or external tables/volumes; You may use force option to delete it but the managed storage data under this location cannot be purged by Unity Catalog anymore.

I have no catalog/schemas or other object under this location, but before forcing the operation is there a way to list all objects dependend from this location?


Solution

  • This is the ugly, slow and dirty solution I wrote to check for dependant objects, I think it could be extended to check also for volumes.

    %python
    import pandas as pd
    
    check_locations = []
    
    query = "SHOW EXTERNAL LOCATIONS"
    result = spark.sql(query)
    for row in result.collect():
      check_locations.append(row["url"])
    
    #######################
    #      catalogs       #
    #######################
    
    query = "SHOW CATALOGS"
    result = spark.sql(query)
    for row in result.collect():
      catalog_name = row["catalog"]
      query = f"DESCRIBE CATALOG EXTENDED {catalog_name}"
      result = spark.sql(query)
      df = pd.DataFrame(result.collect()).transpose()
      if not df.empty:
        df.columns = df.iloc[0]
        df = df[1:]
        if 'Storage Location' in df.columns:
          location = df.iloc[0]["Storage Location"]
          for check_location in check_locations:
            if check_location in location:
              print(catalog_name)
              print(f"\t{location}")
    
    #######################
    #       schemas       #
    #######################
    
      query = f"SHOW SCHEMAS IN {catalog_name}"
      result = spark.sql(query)
      for row in result.collect():
        schema_name = row["databaseName"]
        query = f"DESCRIBE SCHEMA {catalog_name}.{schema_name}"
        result = spark.sql(query)
        df = pd.DataFrame(result.collect()).transpose()
        if not df.empty:
          df.columns = df.iloc[0]
          df = df[1:]
          if 'Location' in df.columns:
            location = df.iloc[0]["Location"]
            for check_location in check_locations:
              if check_location in location:
                print(catalog_name)
                print(f'\t{schema_name}')
                print(f"\t\t{location}")
            
    #######################
    #       tables        #
    #######################
    
        if schema_name != 'information_schema':      
          query = f"SHOW TABLES IN {catalog_name}.{schema_name}"
          result = spark.sql(query)
          for row in result.collect():
            table_name = row["tableName"]
            query = f"DESCRIBE detail  {catalog_name}.{schema_name}.{table_name}"
            result = spark.sql(query)
            for row in spark.sql(query).collect():
              print
              location = row["location"]
              for check_location in check_locations:
                if check_location in location:
                  print(catalog_name)
                  print(f'\t{schema_name}')
                  print(f'\t\t{table_name}')
                  print(f"\t\t\t{location}")