Search code examples
pythonstreamzip

create big zip archives with lot of small files in memory (on the fly) with python


The task is:

  1. read one by one many files from S3 storage
  2. add files to big_archive.zip
  3. store big_archive.zip at S3 storage

Problem:

When we appending new file to zip archive, zip library changing current archive (updating meta-information) and after that adding file contents (bytes). Because archive is big we need to store it by chunks to S3 storage. BUT! Already stored chunks are not able to rewrite. And because of it we can't update meta information.

This code explain the problem:

from io import BytesIO
import zipfile, sys, gc

files = (
    'input/i_1.docx',  # one file size is about ~500KB
    'input/i_2.docx',
    ...
    'input/i_11.docx',
    'input/i_12.docx',
    'input/i_13.docx',
    'input/i_14.docx'
)


# this function allow to get size of in-memory object
# thanks to
# https://towardsdatascience.com/the-strange-size-of-python-objects-in-memory-ce87bdfbb97f
def _get_size(input_obj):
    memory_size = 0
    ids = set()
    objects = [input_obj]
    while objects:
        new = []
        for obj in objects:
            if id(obj) not in ids:
                ids.add(id(obj))
                memory_size += sys.getsizeof(obj)
                new.append(obj)
        objects = gc.get_referents(*new)
    return memory_size


# open in-memory object
with BytesIO() as zip_obj_in_memory:
    # open zip archive on disk
    with open('tmp.zip', 'wb') as resulted_file:
        # set chunk size to 1MB
        chunk_max_size = 1048576  # 1MB
        # iterate over files
        for f in files:
            # get size of in-memory object
            current_size = _get_size(zip_obj_in_memory)
            # if size of in-memory object is bigger than 1MB
            # we need to drop it to S3 storage
            if current_size > chunk_max_size:
                # write file on disk (that is no matter what storge is: S3 or disk)
                resulted_file.write(zip_obj_in_memory.getvalue())
                # remove current in-memory data
                zip_obj_in_memory.seek(0)
                # zip_obj_in_memory size is 0MB after truncate so we able to adding new files
                zip_obj_in_memory.truncate()

            # main process open ip_obj_in_memory object in append mode and append new files
            with zipfile.ZipFile(zip_obj_in_memory, 'a', compression=zipfile.ZIP_DEFLATED) as zf:
                # read file and write it to archive
                with open(f, 'rb') as o:
                    zf.writestr(
                        zinfo_or_arcname=f.replace('input/', 'output/'),
                        data=o.read()
                    )
        # write last chunk of data
        resulted_file.write(zip_obj_in_memory.getvalue())

Now try to get files in archive:

unzip -l tmp.zip
Archive:  tmp.zip
warning [tmp.zip]:  6987483 extra bytes at beginning or within zipfile
  (attempting to process anyway)
  Length      Date    Time    Name
---------  ---------- -----   ----
   583340  12-15-2021 18:43   output/i_13.docx
   583335  12-15-2021 18:43   output/i_14.docx
---------                     -------
  1166675                     2 files

As we can see only last 1MB chunk is shown

Let's fix this archive:

zip -FF tmp.zip --out fixed.zip
Fix archive (-FF) - salvage what can
 Found end record (EOCDR) - says expect single disk archive
Scanning for entries...
 copying: output/i_1.docx  (582169 bytes)
 copying: output/i_2.docx  (582152 bytes)
Central Directory found...
EOCDR found ( 1 1164533)...
 copying: output/i_3.docx  (582175 bytes)
Entry after central directory found ( 1 1164555)...
 copying: output/i_4.docx  (582175 bytes)
Central Directory found...
EOCDR found ( 1 2329117)...
 copying: output/i_5.docx  (582176 bytes)
Entry after central directory found ( 1 2329139)...
 copying: output/i_6.docx  (582180 bytes)
Central Directory found...
EOCDR found ( 1 3493707)...
 copying: output/i_7.docx  (582170 bytes)
Entry after central directory found ( 1 3493729)...
 copying: output/i_8.docx  (582174 bytes)
Central Directory found...
...

And after that:

unzip -l fixed.zip
Archive:  fixed.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
   583344  12-15-2021 18:43   output/i_1.docx
   583337  12-15-2021 18:43   output/i_2.docx
   583346  12-15-2021 18:43   output/i_3.docx
   583352  12-15-2021 18:43   output/i_4.docx
   583361  12-15-2021 18:43   output/i_5.docx
   583368  12-15-2021 18:43   output/i_6.docx
   583356  12-15-2021 18:43   output/i_7.docx
   583362  12-15-2021 18:43   output/i_8.docx
   583337  12-15-2021 18:43   output/i_9.docx
   583352  12-15-2021 18:43   output/i_10.docx
   583363  12-15-2021 18:43   output/i_11.docx
   583368  12-15-2021 18:43   output/i_12.docx
   583340  12-15-2021 18:43   output/i_13.docx
   583335  12-15-2021 18:43   output/i_14.docx
---------                     -------
  8166921                     14 files

Files extracting also working fine.
File contents are correct.

According Wikipedia

Needed meta-information is stored in Central directory (CD)

So we need to remove Central directory info at every file append (before store file to disk (or S3)) and finally add correct info about all files manually.

Is it possible? And how to do that if yes.

At least is here any way to diff tmp.zip and fixed.zip in human readable binary mode to be able check where CD stored and what is format of it.
Any exact references to ZIP that can help with this problem also welcome.


Solution

  • Ok, finally I created that zip's Frankenstein:

    from io import BytesIO
    from zipfile import ZipFile, ZIP_DEFLATED
    import sys
    import gc
    
    files = (
        'input/i_1.docx',  # one file size is about ~580KB
        'input/i_2.docx',
        'input/i_3.docx',
        'input/i_4.docx',
        'input/i_5.docx',
        'input/i_6.docx',
        'input/i_7.docx',
        'input/i_8.docx',
        'input/i_9.docx',
        'input/i_10.docx',
        'input/i_11.docx',
        'input/i_12.docx',
        'input/i_13.docx',
        'input/i_14.docx',
        'input/i_21.docx'
    )
    
    
    # this function allow to get size of in-memory object
    # add only for debug purposes
    def _get_size(input_obj):
        memory_size = 0
        ids = set()
        objects = [input_obj]
        while objects:
            new = []
            for obj in objects:
                if id(obj) not in ids:
                    ids.add(id(obj))
                    memory_size += sys.getsizeof(obj)
                    new.append(obj)
            objects = gc.get_referents(*new)
        return memory_size
    
    
    class CustomizedZipFile(ZipFile):
    
        # add customized BytesIO to be able return faked offset
        class _CustomizedBytesIO(BytesIO):
    
            def __init__(self, fake_offset: int):
                self.fake_offset = fake_offset
                self.temporary_switch_to_faked_offset = False
                super().__init__()
    
            def tell(self):
                if self.temporary_switch_to_faked_offset:
                    # revert tell method to normal mode to minimize faked behaviour
                    self.temporary_switch_to_faked_offset = False
                    return super().tell() + self.fake_offset
                else:
                    return super().tell()
    
        def __init__(self, *args, **kwargs):
            # create empty file to write if fake offset is set
            if 'fake_offset' in kwargs and kwargs['fake_offset'] is not None and kwargs['fake_offset'] > 0:
                self._fake_offset = kwargs['fake_offset']
                del kwargs['fake_offset']
                if 'file' in kwargs:
                    kwargs['file'] = self._CustomizedBytesIO(self._fake_offset)
                else:
                    args = list(args)
                    args[0] = self._CustomizedBytesIO(self._fake_offset)
            else:
                self._fake_offset = 0
            super().__init__(*args, **kwargs)
    
        # finalize zip (should be run only on last chunk)
        def force_write_end_record(self):
            self._write_end_record(False)
    
        # don't write end record by default to be able get not ended chunks
        # ZipFile writing end metainfo on close by default
        def _write_end_record(self, skip_write_end=True):
            if not skip_write_end:
                if self._fake_offset > 0:
                    self.start_dir = self._fake_offset
                    self.fp.temporary_switch_to_faked_offset = True
                super()._write_end_record()
    
    
    def archive(files):
    
        compression_type = ZIP_DEFLATED
        CHUNK_SIZE = 1048576  # 1MB
    
        with open('tmp.zip', 'wb') as resulted_file:
            offset = 0
            filelist = []
            with BytesIO() as chunk:
                for f in files:
                    with BytesIO() as tmp:
                        with CustomizedZipFile(tmp, 'w', compression=compression_type) as zf:
                            with open(f, 'rb') as b:
                                zf.writestr(
                                    zinfo_or_arcname=f.replace('input/', 'output/'),
                                    data=b.read()
                                )
                            zf.filelist[0].header_offset = offset
                            data = tmp.getvalue()
                            offset = offset + len(data)
                        filelist.append(zf.filelist[0])
                    chunk.write(data)
                    print('size of zipfile:', _get_size(zf))
                    print('size of chunk:', _get_size(chunk))
                    if len(chunk.getvalue()) > CHUNK_SIZE:
                        resulted_file.write(chunk.getvalue())
                        chunk.seek(0)
                        chunk.truncate()
                # write last chunk
                resulted_file.write(chunk.getvalue())
            # file parameter may be skipped it we using fake_offset
            # because empty _CustomizedBytesIO will be initialized at constructor
            with CustomizedZipFile(None, 'w', compression=compression_type, fake_offset=offset) as zf:
                zf.filelist = filelist
                zf.force_write_end_record()
                end_data = zf.fp.getvalue()
            resulted_file.write(end_data)
    
    
    archive(files)
    

    Output is:

    size of zipfile: 2182955
    size of chunk: 582336
    size of zipfile: 2182979
    size of chunk: 1164533
    size of zipfile: 2182983
    size of chunk: 582342
    size of zipfile: 2182979
    size of chunk: 1164562
    size of zipfile: 2182983
    size of chunk: 582343
    size of zipfile: 2182979
    size of chunk: 1164568
    size of zipfile: 2182983
    size of chunk: 582337
    size of zipfile: 2182983
    size of chunk: 1164556
    size of zipfile: 2182983
    size of chunk: 582329
    size of zipfile: 2182984
    size of chunk: 1164543
    size of zipfile: 2182984
    size of chunk: 582355
    size of zipfile: 2182984
    size of chunk: 1164586
    size of zipfile: 2182984
    size of chunk: 582338
    size of zipfile: 2182984
    size of chunk: 1164545
    size of zipfile: 2182980
    size of chunk: 582320
    

    So we can see that chunk is always dumped on storage and truncated when reach the maximum chunk size (1MB in my case)

    Resulted archive tested with MacOS The Unarchiver v4.2.4 and Windows 10 default unarchiver and 7-zip

    Note!

    Size of created by chunks archive is 16 bytes bigger than archive created by common zipfile library. Probably some extra zero bytes is written somewhere. I didn't check why it happens

    zipfile is worstest python library I've ever seen before. Looks like it supposed to be used as non-extendable binary-like file