The task is:
big_archive.zip
big_archive.zip
at S3 storageProblem:
When we appending new file to zip archive, zip library changing current archive (updating meta-information) and after that adding file contents (bytes). Because archive is big we need to store it by chunks to S3 storage. BUT! Already stored chunks are not able to rewrite. And because of it we can't update meta information.
This code explain the problem:
from io import BytesIO
import zipfile, sys, gc
files = (
'input/i_1.docx', # one file size is about ~500KB
'input/i_2.docx',
...
'input/i_11.docx',
'input/i_12.docx',
'input/i_13.docx',
'input/i_14.docx'
)
# this function allow to get size of in-memory object
# thanks to
# https://towardsdatascience.com/the-strange-size-of-python-objects-in-memory-ce87bdfbb97f
def _get_size(input_obj):
memory_size = 0
ids = set()
objects = [input_obj]
while objects:
new = []
for obj in objects:
if id(obj) not in ids:
ids.add(id(obj))
memory_size += sys.getsizeof(obj)
new.append(obj)
objects = gc.get_referents(*new)
return memory_size
# open in-memory object
with BytesIO() as zip_obj_in_memory:
# open zip archive on disk
with open('tmp.zip', 'wb') as resulted_file:
# set chunk size to 1MB
chunk_max_size = 1048576 # 1MB
# iterate over files
for f in files:
# get size of in-memory object
current_size = _get_size(zip_obj_in_memory)
# if size of in-memory object is bigger than 1MB
# we need to drop it to S3 storage
if current_size > chunk_max_size:
# write file on disk (that is no matter what storge is: S3 or disk)
resulted_file.write(zip_obj_in_memory.getvalue())
# remove current in-memory data
zip_obj_in_memory.seek(0)
# zip_obj_in_memory size is 0MB after truncate so we able to adding new files
zip_obj_in_memory.truncate()
# main process open ip_obj_in_memory object in append mode and append new files
with zipfile.ZipFile(zip_obj_in_memory, 'a', compression=zipfile.ZIP_DEFLATED) as zf:
# read file and write it to archive
with open(f, 'rb') as o:
zf.writestr(
zinfo_or_arcname=f.replace('input/', 'output/'),
data=o.read()
)
# write last chunk of data
resulted_file.write(zip_obj_in_memory.getvalue())
Now try to get files in archive:
unzip -l tmp.zip
Archive: tmp.zip
warning [tmp.zip]: 6987483 extra bytes at beginning or within zipfile
(attempting to process anyway)
Length Date Time Name
--------- ---------- ----- ----
583340 12-15-2021 18:43 output/i_13.docx
583335 12-15-2021 18:43 output/i_14.docx
--------- -------
1166675 2 files
As we can see only last 1MB chunk is shown
Let's fix this archive:
zip -FF tmp.zip --out fixed.zip
Fix archive (-FF) - salvage what can
Found end record (EOCDR) - says expect single disk archive
Scanning for entries...
copying: output/i_1.docx (582169 bytes)
copying: output/i_2.docx (582152 bytes)
Central Directory found...
EOCDR found ( 1 1164533)...
copying: output/i_3.docx (582175 bytes)
Entry after central directory found ( 1 1164555)...
copying: output/i_4.docx (582175 bytes)
Central Directory found...
EOCDR found ( 1 2329117)...
copying: output/i_5.docx (582176 bytes)
Entry after central directory found ( 1 2329139)...
copying: output/i_6.docx (582180 bytes)
Central Directory found...
EOCDR found ( 1 3493707)...
copying: output/i_7.docx (582170 bytes)
Entry after central directory found ( 1 3493729)...
copying: output/i_8.docx (582174 bytes)
Central Directory found...
...
And after that:
unzip -l fixed.zip
Archive: fixed.zip
Length Date Time Name
--------- ---------- ----- ----
583344 12-15-2021 18:43 output/i_1.docx
583337 12-15-2021 18:43 output/i_2.docx
583346 12-15-2021 18:43 output/i_3.docx
583352 12-15-2021 18:43 output/i_4.docx
583361 12-15-2021 18:43 output/i_5.docx
583368 12-15-2021 18:43 output/i_6.docx
583356 12-15-2021 18:43 output/i_7.docx
583362 12-15-2021 18:43 output/i_8.docx
583337 12-15-2021 18:43 output/i_9.docx
583352 12-15-2021 18:43 output/i_10.docx
583363 12-15-2021 18:43 output/i_11.docx
583368 12-15-2021 18:43 output/i_12.docx
583340 12-15-2021 18:43 output/i_13.docx
583335 12-15-2021 18:43 output/i_14.docx
--------- -------
8166921 14 files
Files extracting also working fine.
File contents are correct.
According Wikipedia
Needed meta-information is stored in Central directory (CD)
So we need to remove Central directory
info at every file append (before store file to disk (or S3)) and finally add correct info about all files manually.
Is it possible? And how to do that if yes.
At least is here any way to diff tmp.zip and fixed.zip in human readable binary mode to be able check where CD stored and what is format of it.
Any exact references to ZIP that can help with this problem also welcome.
Ok, finally I created that zip's Frankenstein:
from io import BytesIO
from zipfile import ZipFile, ZIP_DEFLATED
import sys
import gc
files = (
'input/i_1.docx', # one file size is about ~580KB
'input/i_2.docx',
'input/i_3.docx',
'input/i_4.docx',
'input/i_5.docx',
'input/i_6.docx',
'input/i_7.docx',
'input/i_8.docx',
'input/i_9.docx',
'input/i_10.docx',
'input/i_11.docx',
'input/i_12.docx',
'input/i_13.docx',
'input/i_14.docx',
'input/i_21.docx'
)
# this function allow to get size of in-memory object
# add only for debug purposes
def _get_size(input_obj):
memory_size = 0
ids = set()
objects = [input_obj]
while objects:
new = []
for obj in objects:
if id(obj) not in ids:
ids.add(id(obj))
memory_size += sys.getsizeof(obj)
new.append(obj)
objects = gc.get_referents(*new)
return memory_size
class CustomizedZipFile(ZipFile):
# add customized BytesIO to be able return faked offset
class _CustomizedBytesIO(BytesIO):
def __init__(self, fake_offset: int):
self.fake_offset = fake_offset
self.temporary_switch_to_faked_offset = False
super().__init__()
def tell(self):
if self.temporary_switch_to_faked_offset:
# revert tell method to normal mode to minimize faked behaviour
self.temporary_switch_to_faked_offset = False
return super().tell() + self.fake_offset
else:
return super().tell()
def __init__(self, *args, **kwargs):
# create empty file to write if fake offset is set
if 'fake_offset' in kwargs and kwargs['fake_offset'] is not None and kwargs['fake_offset'] > 0:
self._fake_offset = kwargs['fake_offset']
del kwargs['fake_offset']
if 'file' in kwargs:
kwargs['file'] = self._CustomizedBytesIO(self._fake_offset)
else:
args = list(args)
args[0] = self._CustomizedBytesIO(self._fake_offset)
else:
self._fake_offset = 0
super().__init__(*args, **kwargs)
# finalize zip (should be run only on last chunk)
def force_write_end_record(self):
self._write_end_record(False)
# don't write end record by default to be able get not ended chunks
# ZipFile writing end metainfo on close by default
def _write_end_record(self, skip_write_end=True):
if not skip_write_end:
if self._fake_offset > 0:
self.start_dir = self._fake_offset
self.fp.temporary_switch_to_faked_offset = True
super()._write_end_record()
def archive(files):
compression_type = ZIP_DEFLATED
CHUNK_SIZE = 1048576 # 1MB
with open('tmp.zip', 'wb') as resulted_file:
offset = 0
filelist = []
with BytesIO() as chunk:
for f in files:
with BytesIO() as tmp:
with CustomizedZipFile(tmp, 'w', compression=compression_type) as zf:
with open(f, 'rb') as b:
zf.writestr(
zinfo_or_arcname=f.replace('input/', 'output/'),
data=b.read()
)
zf.filelist[0].header_offset = offset
data = tmp.getvalue()
offset = offset + len(data)
filelist.append(zf.filelist[0])
chunk.write(data)
print('size of zipfile:', _get_size(zf))
print('size of chunk:', _get_size(chunk))
if len(chunk.getvalue()) > CHUNK_SIZE:
resulted_file.write(chunk.getvalue())
chunk.seek(0)
chunk.truncate()
# write last chunk
resulted_file.write(chunk.getvalue())
# file parameter may be skipped it we using fake_offset
# because empty _CustomizedBytesIO will be initialized at constructor
with CustomizedZipFile(None, 'w', compression=compression_type, fake_offset=offset) as zf:
zf.filelist = filelist
zf.force_write_end_record()
end_data = zf.fp.getvalue()
resulted_file.write(end_data)
archive(files)
Output is:
size of zipfile: 2182955
size of chunk: 582336
size of zipfile: 2182979
size of chunk: 1164533
size of zipfile: 2182983
size of chunk: 582342
size of zipfile: 2182979
size of chunk: 1164562
size of zipfile: 2182983
size of chunk: 582343
size of zipfile: 2182979
size of chunk: 1164568
size of zipfile: 2182983
size of chunk: 582337
size of zipfile: 2182983
size of chunk: 1164556
size of zipfile: 2182983
size of chunk: 582329
size of zipfile: 2182984
size of chunk: 1164543
size of zipfile: 2182984
size of chunk: 582355
size of zipfile: 2182984
size of chunk: 1164586
size of zipfile: 2182984
size of chunk: 582338
size of zipfile: 2182984
size of chunk: 1164545
size of zipfile: 2182980
size of chunk: 582320
So we can see that chunk is always dumped on storage and truncated when reach the maximum chunk size (1MB in my case)
Resulted archive tested with MacOS The Unarchiver v4.2.4 and Windows 10 default unarchiver and 7-zip
Size of created by chunks archive is 16 bytes bigger than archive created by common zipfile
library. Probably some extra zero
bytes is written somewhere. I didn't check why it happens
zipfile
is worstest python library I've ever seen before. Looks like it supposed to be used as non-extendable binary-like file