Search code examples
pythonpython-3.xzip

Python - Writing zips of a specific size is very unreliable


I'm attempting to write a script in python that will zip up a directory. I want to zip files until the zip file is around 500MB, then start a new zip, until all the files have been zipped. The rules then look like:

1) Walks a directory finding all files (using os.walk)
2) Begin writing said files to a zip until the zip is ~500MB
3) Once I've reached that limit, start a new zip following same rules (~500MB limit)
4) End result being N zips all around ~500MB or less

My code right now looks like:

#!/usr/bin/env python3

from zipfile import ZipFile
import os
import math as m

current_dir = os.getcwd()
deal_name = current_dir.split("/")[-1:][0]
deal_folder = f'{current_dir}/{deal_name}'
deal_folder_exists = os.path.isdir(deal_folder)
file_paths = []
vol = 1
ZIP_MAX_SIZE = 500

if not deal_folder_exists:
    print(f'Can not find deal folder: {deal_folder}')
    raise Exception('Missing Deal Folder')

# Generate a list of all files to be written to zip
for root, directories, files in os.walk(deal_folder):
    if files:
        # We have files to add to the zip
        for file in files:
            file_paths.append(f'.{root.replace(deal_folder, "")}/{file}')

# Change into the deal folder
os.chdir(deal_folder)

# writing files to a zipfile
deal_zip_path = f'../{deal_name}-vol{vol}.zip'
deal_zip = ZipFile(deal_zip_path, 'w')

# Just a dict for keepin track of the end size of each zip
zip_data = {deal_zip_path: 0}

# Begin looping over the files, and writing the files to the zip
for file in file_paths:
    deal_zip.write(file)
    size = round(sum([info.file_size for info in deal_zip.infolist()]) / 1e+6)

    # Track the current size
    zip_data[deal_zip_path] = size

    # If the current size exceeds the max, bump the vol var and start a new zip
    if size > ZIP_MAX_SIZE:
        deal_zip.close()
        vol += 1
        deal_zip_path = f'../{deal_name}-vol{vol}.zip'
        deal_zip = ZipFile(deal_zip_path, 'w')

# Close the final zip
deal_zip.close()

# Log the deets
print(zip_data)
print('All files zipped successfully!')

The zip_data print looks like this:

{
    '../magical-holiday-goodies-vol1.zip': 542, 
    '../magical-holiday-goodies-vol2.zip': 503, 
    '../magical-holiday-goodies-vol3.zip': 505, 
    '../magical-holiday-goodies-vol4.zip': 545, 
    # sometime later
    '../magical-holiday-goodies-vol15.zip': 309
}

So it appears that the script is doing exactly what it should be doing. However, the end results of the zip are super unpredictable. For instance, vol1.zip above says it should be 542MB, when in reality I get:

Note the 12.2 MB zip...

Any idea why my logging shows the correct file sizes, when in reality the resulting zip sizes are all over the place?


Solution

  • It turns out ZipFile is just storing the files.

    Replace the constructor call with: deal_zip = ZipFile(deal_zip_path, 'w', compression = ZIP_DEFLATED)

    Also: from zipfile import ZipFile, ZIP_DEFLATED