create big zip archives with lot of small files in memory (on the fly) with python

The task is:

read one by one many files from S3 storage
add files to big_archive.zip
store big_archive.zip at S3 storage

Problem:

When we appending new file to zip archive, zip library changing current archive (updating meta-information) and after that adding file contents (bytes). Because archive is big we need to store it by chunks to S3 storage. BUT! Already stored chunks are not able to rewrite. And because of it we can't update meta information.

This code explain the problem:

from io import BytesIO
import zipfile, sys, gc

files = (
    'input/i_1.docx',  # one file size is about ~500KB
    'input/i_2.docx',
    ...
    'input/i_11.docx',
    'input/i_12.docx',
    'input/i_13.docx',
    'input/i_14.docx'
)


# this function allow to get size of in-memory object
# thanks to
# https://towardsdatascience.com/the-strange-size-of-python-objects-in-memory-ce87bdfbb97f
def _get_size(input_obj):
    memory_size = 0
    ids = set()
    objects = [input_obj]
    while objects:
        new = []
        for obj in objects:
            if id(obj) not in ids:
                ids.add(id(obj))
                memory_size += sys.getsizeof(obj)
                new.append(obj)
        objects = gc.get_referents(*new)
    return memory_size


# open in-memory object
with BytesIO() as zip_obj_in_memory:
    # open zip archive on disk
    with open('tmp.zip', 'wb') as resulted_file:
        # set chunk size to 1MB
        chunk_max_size = 1048576  # 1MB
        # iterate over files
        for f in files:
            # get size of in-memory object
            current_size = _get_size(zip_obj_in_memory)
            # if size of in-memory object is bigger than 1MB
            # we need to drop it to S3 storage
            if current_size > chunk_max_size:
                # write file on disk (that is no matter what storge is: S3 or disk)
                resulted_file.write(zip_obj_in_memory.getvalue())
                # remove current in-memory data
                zip_obj_in_memory.seek(0)
                # zip_obj_in_memory size is 0MB after truncate so we able to adding new files
                zip_obj_in_memory.truncate()

            # main process open ip_obj_in_memory object in append mode and append new files
            with zipfile.ZipFile(zip_obj_in_memory, 'a', compression=zipfile.ZIP_DEFLATED) as zf:
                # read file and write it to archive
                with open(f, 'rb') as o:
                    zf.writestr(
                        zinfo_or_arcname=f.replace('input/', 'output/'),
                        data=o.read()
                    )
        # write last chunk of data
        resulted_file.write(zip_obj_in_memory.getvalue())

Now try to get files in archive:

unzip -l tmp.zip

Archive:  tmp.zip
warning [tmp.zip]:  6987483 extra bytes at beginning or within zipfile
  (attempting to process anyway)
  Length      Date    Time    Name
---------  ---------- -----   ----
   583340  12-15-2021 18:43   output/i_13.docx
   583335  12-15-2021 18:43   output/i_14.docx
---------                     -------
  1166675                     2 files

As we can see only last 1MB chunk is shown

Let's fix this archive:

zip -FF tmp.zip --out fixed.zip

Fix archive (-FF) - salvage what can
 Found end record (EOCDR) - says expect single disk archive
Scanning for entries...
 copying: output/i_1.docx  (582169 bytes)
 copying: output/i_2.docx  (582152 bytes)
Central Directory found...
EOCDR found ( 1 1164533)...
 copying: output/i_3.docx  (582175 bytes)
Entry after central directory found ( 1 1164555)...
 copying: output/i_4.docx  (582175 bytes)
Central Directory found...
EOCDR found ( 1 2329117)...
 copying: output/i_5.docx  (582176 bytes)
Entry after central directory found ( 1 2329139)...
 copying: output/i_6.docx  (582180 bytes)
Central Directory found...
EOCDR found ( 1 3493707)...
 copying: output/i_7.docx  (582170 bytes)
Entry after central directory found ( 1 3493729)...
 copying: output/i_8.docx  (582174 bytes)
Central Directory found...
...

And after that:

unzip -l fixed.zip

Archive:  fixed.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
   583344  12-15-2021 18:43   output/i_1.docx
   583337  12-15-2021 18:43   output/i_2.docx
   583346  12-15-2021 18:43   output/i_3.docx
   583352  12-15-2021 18:43   output/i_4.docx
   583361  12-15-2021 18:43   output/i_5.docx
   583368  12-15-2021 18:43   output/i_6.docx
   583356  12-15-2021 18:43   output/i_7.docx
   583362  12-15-2021 18:43   output/i_8.docx
   583337  12-15-2021 18:43   output/i_9.docx
   583352  12-15-2021 18:43   output/i_10.docx
   583363  12-15-2021 18:43   output/i_11.docx
   583368  12-15-2021 18:43   output/i_12.docx
   583340  12-15-2021 18:43   output/i_13.docx
   583335  12-15-2021 18:43   output/i_14.docx
---------                     -------
  8166921                     14 files

Files extracting also working fine.
File contents are correct.

According Wikipedia

Needed meta-information is stored in Central directory (CD)

So we need to remove Central directory info at every file append (before store file to disk (or S3)) and finally add correct info about all files manually.

Is it possible? And how to do that if yes.

At least is here any way to diff tmp.zip and fixed.zip in human readable binary mode to be able check where CD stored and what is format of it.
Any exact references to ZIP that can help with this problem also welcome.

Solution

Ok, finally I created that zip's Frankenstein:

from io import BytesIO
from zipfile import ZipFile, ZIP_DEFLATED
import sys
import gc

files = (
    'input/i_1.docx',  # one file size is about ~580KB
    'input/i_2.docx',
    'input/i_3.docx',
    'input/i_4.docx',
    'input/i_5.docx',
    'input/i_6.docx',
    'input/i_7.docx',
    'input/i_8.docx',
    'input/i_9.docx',
    'input/i_10.docx',
    'input/i_11.docx',
    'input/i_12.docx',
    'input/i_13.docx',
    'input/i_14.docx',
    'input/i_21.docx'
)


# this function allow to get size of in-memory object
# add only for debug purposes
def _get_size(input_obj):
    memory_size = 0
    ids = set()
    objects = [input_obj]
    while objects:
        new = []
        for obj in objects:
            if id(obj) not in ids:
                ids.add(id(obj))
                memory_size += sys.getsizeof(obj)
                new.append(obj)
        objects = gc.get_referents(*new)
    return memory_size


class CustomizedZipFile(ZipFile):

    # add customized BytesIO to be able return faked offset
    class _CustomizedBytesIO(BytesIO):

        def __init__(self, fake_offset: int):
            self.fake_offset = fake_offset
            self.temporary_switch_to_faked_offset = False
            super().__init__()

        def tell(self):
            if self.temporary_switch_to_faked_offset:
                # revert tell method to normal mode to minimize faked behaviour
                self.temporary_switch_to_faked_offset = False
                return super().tell() + self.fake_offset
            else:
                return super().tell()

    def __init__(self, *args, **kwargs):
        # create empty file to write if fake offset is set
        if 'fake_offset' in kwargs and kwargs['fake_offset'] is not None and kwargs['fake_offset'] > 0:
            self._fake_offset = kwargs['fake_offset']
            del kwargs['fake_offset']
            if 'file' in kwargs:
                kwargs['file'] = self._CustomizedBytesIO(self._fake_offset)
            else:
                args = list(args)
                args[0] = self._CustomizedBytesIO(self._fake_offset)
        else:
            self._fake_offset = 0
        super().__init__(*args, **kwargs)

    # finalize zip (should be run only on last chunk)
    def force_write_end_record(self):
        self._write_end_record(False)

    # don't write end record by default to be able get not ended chunks
    # ZipFile writing end metainfo on close by default
    def _write_end_record(self, skip_write_end=True):
        if not skip_write_end:
            if self._fake_offset > 0:
                self.start_dir = self._fake_offset
                self.fp.temporary_switch_to_faked_offset = True
            super()._write_end_record()


def archive(files):

    compression_type = ZIP_DEFLATED
    CHUNK_SIZE = 1048576  # 1MB

    with open('tmp.zip', 'wb') as resulted_file:
        offset = 0
        filelist = []
        with BytesIO() as chunk:
            for f in files:
                with BytesIO() as tmp:
                    with CustomizedZipFile(tmp, 'w', compression=compression_type) as zf:
                        with open(f, 'rb') as b:
                            zf.writestr(
                                zinfo_or_arcname=f.replace('input/', 'output/'),
                                data=b.read()
                            )
                        zf.filelist[0].header_offset = offset
                        data = tmp.getvalue()
                        offset = offset + len(data)
                    filelist.append(zf.filelist[0])
                chunk.write(data)
                print('size of zipfile:', _get_size(zf))
                print('size of chunk:', _get_size(chunk))
                if len(chunk.getvalue()) > CHUNK_SIZE:
                    resulted_file.write(chunk.getvalue())
                    chunk.seek(0)
                    chunk.truncate()
            # write last chunk
            resulted_file.write(chunk.getvalue())
        # file parameter may be skipped it we using fake_offset
        # because empty _CustomizedBytesIO will be initialized at constructor
        with CustomizedZipFile(None, 'w', compression=compression_type, fake_offset=offset) as zf:
            zf.filelist = filelist
            zf.force_write_end_record()
            end_data = zf.fp.getvalue()
        resulted_file.write(end_data)


archive(files)

Output is:

size of zipfile: 2182955
size of chunk: 582336
size of zipfile: 2182979
size of chunk: 1164533
size of zipfile: 2182983
size of chunk: 582342
size of zipfile: 2182979
size of chunk: 1164562
size of zipfile: 2182983
size of chunk: 582343
size of zipfile: 2182979
size of chunk: 1164568
size of zipfile: 2182983
size of chunk: 582337
size of zipfile: 2182983
size of chunk: 1164556
size of zipfile: 2182983
size of chunk: 582329
size of zipfile: 2182984
size of chunk: 1164543
size of zipfile: 2182984
size of chunk: 582355
size of zipfile: 2182984
size of chunk: 1164586
size of zipfile: 2182984
size of chunk: 582338
size of zipfile: 2182984
size of chunk: 1164545
size of zipfile: 2182980
size of chunk: 582320

So we can see that chunk is always dumped on storage and truncated when reach the maximum chunk size (1MB in my case)

Resulted archive tested with MacOS The Unarchiver v4.2.4 and Windows 10 default unarchiver and 7-zip

Note!

Size of created by chunks archive is 16 bytes bigger than archive created by common zipfile library. Probably some extra zero bytes is written somewhere. I didn't check why it happens

_{^{zipfile is worstest python library I've ever seen before. Looks like it supposed to be used as non-extendable binary-like file}}