The degree of compression achieved by any compression algorithm is obviously dependent on the data provided. However, there is also clearly some overhead added purely by virtue of having compressed data.
I'm working on a process where I am compressing data that can be of various types but where I know much of the data will be very small though it will also frequently be large enough to benefit from some level of compression. While I can probably just experimentally determine some minimum before compression is applied that will work well enough I am curious if there's a clear point where this is definitely not worth it.
Running some tests using zip
, I compressed a series of files with 10, 100, and 1000 bytes respectively of random data and the alphabet repeated. For example here is the content of the 100 byte alphabet file:
abcdefghijklmnopqrstuvwxyz abcdefghijklmnopqrstuvwxyz abcdefghijklmnopqrstuvwxyz abcdefghijklmnopqr
I was fairly surprised to find that the zipped version of the file was 219 bytes, despite the level of redundancy. For comparison the 100 byte file with random data became 272 bytes.
However, the 1000 byte alphabet file compressed all the way down to 227 bytes, while the random file increased to 1174.
Is there a clear minimum file size where even the most redundant files will not benefit from this type of compression?
Something between 250 and 500 bytes would be a decent threshold depending on the level of redundancy and assuming the time spent compressing the data is negligible.
I got to this by realizing that fully redundant data (every byte the same) would likely result in the greatest level of compression.
Re-running the same tests with data read from /dev/zero
, I found that the compressed file length was not really that variable:
Uncompressed | Compressed | Percent Size -------------+------------+------------- 100 bytes | 178 bytes | 178% 200 bytes | 178 bytes | 89% 300 bytes | 179 bytes | 60% 400 bytes | 180 bytes | 45% 500 bytes | 180 bytes | 36% ... 1000 bytes | 185 bytes | 19%
This makes a decent case for the answer being technically 178 bytes (I tested this case and got 178 bytes).
However, I think the alphabet test is probably a bit closer to a practical best case of redundancy (without knowing much about how DEFLATE looks for redundancy).
Using various files in the same format as in the question, I found the following:
Uncompressed | Compressed | Percent Size -------------+------------+------------- 100 bytes | 212 bytes | 212% 200 bytes | 212 bytes | 106% 300 bytes | 214 bytes | 71% 400 bytes | 214 bytes | 54% 500 bytes | 214 bytes | 43% ... 1000 bytes | 221 bytes | 22%
And unsurprisingly 212 seems to be a fixed point for this type of file.
Lastly, I decided to try a more direct approach with lorem ipsum text and eventually found that 414 bytes was the fixed point there.
Based on all of this, I would assume something between 250 and 500 would be a reasonable lower limit for skipping compression for general text that may or may not have some level of redundancy on average. One may even want to go higher if benchmarking reveals the time the compression takes isn't worth the minor benefit in space.