Search code examples
javafilecompressionziprar

Java - Calculate File Compression


Is there a way to get the possible compression ratio of a file just reading it?
You know, some files are more compressible then others... my software has to tell me the percentage of possible compression of my files.

e.g.
Compression Ratio: 50% -> I can save 50% of my file's space if I compress it
Compression Ratio: 99% -> I can save only 1% of my file's space if I compress it


Solution

  • Firstly, you need to work on information theory. There are two theory about information theory field:

    1. According to Shannon, one can compute entropy (i.e. compressed size) of a source by using it's symbol probabilities. So, smallest compression size defined by an statistical model which produces symbol probabilities at each step. All algorithms use that approach implicitly or explicitly to compress data. Look that Wikipedia article for more details.
    2. According to Kolmogorov, smallest compression size can be found by finding smallest possible program which produces the source. In that sense, it cannot be compute-able. Some program partially use that approach to compress data (e.g. you can write a small console application which can produce 1 million digits of PI instead of zipping that 1 million digits of PI).

    So, you can't find compressed size without evaluating actual compression. But, if you need an approximation, you can rely on Shannon's entropy theory and build a simple statistical model. Here is a very simple solution:

    1. Compute order-1 statistics for each symbol in the source file.
    2. Calculate entropy by using those statistics.

    Your estimation will be more or less same as ZIP's default compression algorithm (deflate). Here is a more advanced version of same idea (be aware it uses lots of memory!). It actually uses entropy to determine blocks boundaries to apply segmentation for dividing file into homogeneous data.