Search code examples
pythonfilezipdatabricksparquet

How to zip two files in Databricks?


I have two files stored in 'dbfs/tmp/folder/' I am trying to zip the files, although the code runs with no error but the created .zip file cannot be seen in the folder. What is the best way to zip two files in Databricks?

code:-

file_paths = ['/dbfs/dbfs/tmp/folder1/test1.parquet', 
              '/dbfs/dbfs/tmp/folder1/test2.parquet']
zip_name = 'myzip.zip'
zip_file = zipfile.ZipFile(zip_name, "w")
for file in file_paths:
  zip_file.write(file)
zip_file.close()

Executes with no error but the zipped folder cannot been seen under '/dbfs/dbfs/tmp/folder1/'


Solution

  • By default file will be created on the local disk of the driver node. But you can't put /dbfs/... as output destination because of the DBFS limitations, described in this answer. What you'll need is:

    1. Write file to the local disk into the know location, for example, zip_name = 'myzip.zip'
    2. When file is written, copy it to the DBFS using dbutils.fs.cp command using file: as prefix for the local file name:
    zip_name = '/tmp/myzip.zip'
    zip_file = zipfile.ZipFile(zip_name, "w")
    for file in file_paths:
      zip_file.write(file)
    zip_file.close()
    # copy file from local disk to DBFS...
    dbutils.fs.cp(f"file:{zip_name}", zip_name)