Search code examples
zipdatabricksazure-databricks

Databricks reading from a zip file


I have mounted an Azure Blob Storage in the Azure Databricks workspace filestore. The mounted container has zipped files with csv files in them. I mounted the data using dbuitls:

dbutils.fs.mount(
source = f"wasbs://{container}@{storage_account}.blob.core.windows.net",
mount_point = mountPoint,
extra_configs = {f"fs.azure.sas.{container}.{storage_account}.blob.core.windows.net":sasKey})

And then I followed the following tutorial: https://learn.microsoft.com/en-us/azure/databricks/_static/notebooks/zip-files-python.html

But in above since shell command does not work probably because the data does not reside in dbfs but blob storage which is mounted, which gives the error:

unzip:  cannot find or open `/mnt/azureblobstorage/file.zip, /mnt/azureblobstorage/Deed/file.zip.zip or /mnt/azureblobstorage/file.zip.ZIP.`

What would be the best way to read the zipped files and write into a delta table?


Solution

  • The "zip" utility in unix does work. I will walk thru the commands so that you can code a dynamic notebook to extract zips files. The zip file is in ADLS Gen 2 and the extracted files are placed there also.

    Because we are using a shell command, this runs at the JVM know as the executor not all the worked nodes. Therefore there is no parallelization.

    enter image description here

    We can see that I have the storage mounted.

    enter image description here

    The S&P 500 are the top 505 stocks and their data for 2013. All these files are in a windows zip file.

    enter image description here

    Cell 2 defines widgets (parameters) and retrieves their values. This only needs to be done once. The call program can pass the correct parameters to the program

    Cell 3 creates variables in the OS (shell) for both the file path and file name.

    enter image description here

    In cell 4, we use a shell call to the unzip program to over write the existing directory/files with the contents of the zip file. This there is not existing directory, we just get the uncompressed files.

    enter image description here

    Last but not least, the files do appear in the sub-directory as instructed.

    To recap, it is possible to unzip files using databricks (spark) using both remote storage or already mounted storage (local). Use the above techniques to accomplish this task is a note book that can be called repeatedly.