Search code examples
pythonazure-blob-storageazure-databricksdelta-lake

Convert storage account Azure into Databricks delta tables


I just linked an Azure storage account (Storage gen2) with its underlying containers to my Databricks environment. Inside the storage account are two containers each with some subdirectories. Inside the folders are .csv files.

I have connected an Azure service principal with Azure Blog Data Contributor access to the storage account inside databricks so I can read and write to the storage account.

I am trying to figure out the best way to convert the existing storage account into a delta lake (tables inside the metastore + convert the files to parquet (delta tables).

What is the easiest way to do that?

My naive approach as a beginner might be

  1. Read the folder using spark.read.format("csv).load("{container}@{storage}..../directory)

  2. Write to a new folder with similar name (so if folder is directory, write it to directory_parquet) using df.write.format("delta").save({container}@{storage}.../directory_parquet)

And then not sure on the last steps? This would create a new folder with a new set of files. But it wouldn't be a table in databricks that shows up in the hive store. But I do get parquet files.

Alternatively I can use df.write.format().saveAsTable("tablename") but that doesn't create the table in the storage account, but inside the databricks file system, but does show up in the hive metastore.

  1. delete the existing data files if desired (or have it duplicated)

Preferably this can be done in a Databricks workbook using python as preferred, or scala/sql if necessary.

*As a possible solution, if the efforts to do this are monumental, just converting to parquet and getting table information for each subfolder into hive storage as a format of database=containerName tableName=subdirectoryName

The folder structure is pretty flat at the moment, so only rootcontainer/Subfolders deep.


Solution

  • Perhaps an external table is what you're looking for:

    df.write.format("delta").option("path", "some/external/path").saveAsTable("tablename") 
    

    This post has more info on external tables vs managed tables.