Search code examples
pythonzipdatabricksazure-databricks

Compress CSV to ZIP in dbfs (databricks file storage)


I'm trying to compress a csv, located in an azure datalake, to zip. The operation is done with python code in databricks, where I created a mount point to relate directly dbfs with the datalake.

This is my code:

import os
import zipfile 

csv_path= '/dbfs/mnt/<path>.csv'
zip_path= '/dbfs/mnt/<path>.zip' 

with zipfile.ZipFile(zip_path, 'w') as zip:
    zip.write(csv_path)  # zipping the file

But I'm getting this error:

OSError: [Errno 95] Operation not supported

Is there any way of doing it?

Thank you in advance.


Solution

  • No, this is not possible to do like you did. The main reason is that local DBFS API has limitations - it doesn't support random writes that is required when you're creating a zip file.

    The workaround would be following - output zip file to the local disk of the driver node, and then use dbutils.fs.mv to move file to DBFS, something like this:

    import os
    import zipfile 
    
    csv_path= '/dbfs/mnt/<path>.csv'
    zip_path= '/dbfs/mnt/<path>.zip' 
    local_path = '/tmp/my_file.zip'
    
    with zipfile.ZipFile(local_path, 'w') as zip:
        zip.write(csv_path)  # zipping the file
    dbutils.fs.mv(f"file:{local_path}", zip_path)