I must compress many similar files, can I exploit the fact they are similar?

I have a dataset with many different samples (numpy arrays). It is rather impractical to store everything in only one file, so I store many different 'npz' files (numpy arrays compressed in zip).

Now I feel that if I could somehow exploit the fact that all the files are similar to one another I could achieve a much higher compression factor, meaning a much smaller footprint on my disk.

Is it possible to store separately a 'zip basis'? I mean something which is calculated for all the files together and embodies their statistical features and is needed for decompression, but is shared between all the files.

I would have said 'zip basis' file and a separate list of compressed files, which would be much smaller in size than each file zipped alone, and to decompress I would use the share 'zip basis' every time for each file.

Is it technically possible? Is there something that works like this?

Solution

tldr; It depends on the size of each individual file and the data there-in. For example, characteristics / use-cases / access patterns likely vary wildly between 234567x100 byte files and 100x234567 byte files.

Now I feel that if I could somehow exploit the fact that all the files are similar to one another I could achieve a much higher compression factor, meaning a much smaller footprint on my disk.

Possibly. Shared compression benefits will decrease as the size of the file increase.

Regardless, even using a Mono File implementation (let's say a standard zip) may save significantly effective disk space for very many very small files as it avoids overheads required by file-systems to manage individual files; if nothing else, many implementations must be aligned to full blocks [eg. 512-4k bytes]. Plus, free compression using a ubiquitously supported format.

Is it possible to store separately a 'zip basis'? I mean something which is calculated for all the files together and embodies their statistical features and is needed for decompression, but is shared between all the files.

This 'zip basis' is sometimes called a Pre-shared Dictionary.

I would have said 'zip basis' file and a separate list of compressed files, which would be much smaller in size than each file zipped alone, and to decompress I would use the share 'zip basis' every time for each file.

Is it technically possible? Is there something that works like this?

Yes, it's possible. SDCH (Shared Dictionary Compression for HTTP) was one such implementation designed for common web files (eg. HTTP/CSS/JavaScript). In certain cases it could achieve significantly higher compression than standard DEFLATE.

The approach can be emulated with many compression algorithms that works on streams where the compression dictionary is encoded as part of the stream-as-written. (U = Uncompressed, C = compressed.)

To compress:

[U:shared_dict] + [U:data] -> [C:shared_dict] + [C:data]
^-- "zip basis"                                 ^-- write only this to file
                              ^-- artifact of priming

To decompress:

[C:shared_dict] + [C:data] -> [U:shared_dict] + [U:data]
^-- add this back before decompressing!         ^-- use this

The overall space saved depends on many factors, including how useful the initial priming dictionary is and on the specific compressor details. LZ78-esque implementations are particular well-suited to the approach above due to use of a sliding-window that acts as a lookup dictionary.

Alternatively, it may be possible to use domain-specific knowledge and/or encoding to also achieve better compression with specialized compression schemes. An example of this is SQL Server's Page Compression which exploits data similarities between columns on different rows.