Search code examples
pythonpysparkhiveazure-databrickshive-metastore

Create Hive Metastore Tables from Multiple Sub Folders on Blob storage


I have Azure delta tables per the below folder structure on blob storage.

Lvl1/Lvl2/db1/Table1
Lvl1/Lvl2/db1/Table2
Lvl1/Lvl2/db1/Table3
Lvl1/Lvl2/db2/Table1
Lvl1/Lvl2/db2/Table2
Lvl1/Lvl2/db2/Table3
Lvl1/Lvl2/db3/Table1

I want to create Hive Metastore table links for All the above tables under a single database So I created the database using the following Command

spark.sql(f'CREATE DATABASE IF NOT EXISTS parentdb')

I am currently linking the tables by using the following command Tablename = [dynamically generates the tablename]

spark.sql(f'CREATE TABLE IF NOT EXISTS parent_db.{tablename} USING DELTA LOCATION \'{path}\'')

I want spark to read all the above table locations, and create the tables with the tablenames within the single database that I have created above. So Hive Metastore when browsed from Databricks  Data tab should look like this

Parent_db -->   db1_table1
Db2_table1
Db2_table2
Db1_table2
Db1_table3
Db3_table3
.
.
.

I can create the dynamic table namings with db1, db2,db3 … the issue is only to read all the tables from the delta location and create the tables (reading all subfolders within the root folder)

So All i want is to loop through the Folders and create link for all tables under the single db. Any help with this one please …


Solution

  • I have reproduced the above and able to get the tables stored in hive meta store database.

    First, I have the same delta tables in my blob storage with the same path at my mount location.

    Then use the below code to get the paths list of delta tables and create a list for dbs and tables like below.

    import glob, os
    paths=[x[5:]for x in glob.iglob('/dbfs/mnt/data/Lvl1/Lvl2/**/*')]
    print("paths list  : ",paths)
    
    dbs_list=[x[-2] for x in [y.split('/') for y in paths]]
    print("dbs list : ",dbs_list)
    
    table_list=[x[-1] for x in [y.split('/') for y in paths]]
    print("table list : ",table_list)
    

    Then use the below code to create the tables in hive metastore.

    spark.sql(f'CREATE DATABASE IF NOT EXISTS parentdb2')
    for i in range(0,len(paths)):
        table_name=dbs_list[i]+'_'+table_list[i]
        spark.sql(f'CREATE TABLE IF NOT EXISTS parentdb2.{table_name} USING DELTA LOCATION \'{paths[i]}\'')
    

    My Execution:

    enter image description here

    enter image description here

    Result:

    enter image description here