Search code examples
filepdfzipcompressiondeflate

Compression Gzip/7Zip


When I calculate the entropy values of files compressed with Gzip, PKZIP, 7ZIP and Winrar, I find that the compression rate of Gzip is higher than the others. The entropy value is higher (indicating less redundancy) and the file size is smaller. Even for small files, the overhead of Gzip is lower compared to the other algorithms. To be fair, this is not the case for all file formats, e.g. for xlsx, 7- ZIP and PKzip have better results than Gzip and Winrar. But still. I'm quite surprised because 7- ZIP is generally considered a better compression algorithm in terms of.... it reduces the file size more, but that does not really correspond with my results. Or I did something completely wrong... or...?

I did not base these results on a few files, I compressed a whole bunch of things from different file formats and calculated the delta of the file sizes with Python.

What I also find quite interesting. When I look at PDF files, I would expect that especially PDF 1.5 or higher can hardly be compressed by a lossless compression algorithm, as they are already heavily compressed by themselves. But I don't see much difference between PDF < 1.5 and 1.5 >, both are compressed quite heavily by these compression tools.

By the way, I used the default algorithms and settings of these archivers

Can someone explain how/why this is the case (maybe I'm doing something wrong) or maybe these results does make sense (but I can't find something on the internet that does support this)?


Solution

  • "The entropy value is higher (indicating less redundancy) ...". The entropy is relative to a model of the data. If you are using zeroth-order entropy, that can only provide an indication that the data has been compressed (or encrypted), and appears to be random. If the result is close to the number of bits you are measuring, which I'm sure it is in this case, then it can't be used to compare the effectiveness of compression.

    "... and the file size is smaller." That's the only way to compare the effectiveness of compression.

    The tools you mention all, except for gzip, have several different compression methods they can employ. For each (including gzip), there are levels of compression, i.e. how hard it works at it, that can be specified. If you're going to attempt to benchmark compression methods, you need to at least say what they were and what parameters were given to them.

    Though you don't need to bother. There are many that have already been done for you. Google "compression benchmark".