Any theoretical limit to compression?

Imagine that you had all the supercomputers in the world at your disposal for the next 10 years. Your task was to compress 10 full-length movies losslessly as much as possible. Another criteria was that a normal computer should be able to decompress it on the fly and should not need to spend much of his HD to install the decompressing software.

My question is, how much more compression could you achieve than the best alternatives today? 1%, 5%, 50%? More specifically: is there a theoretical limit to compression, given a fixed dictionary size (if it is called that for video compression as well)?

Solution

It is important to redefine the limits with the latest developments regarding information theory. For this purpose, I recommend the following article which is very detailed and clear.

A Modern approach to compression limit

In order to define a compression limit it is essential to report the hypotheses for which the limit is valid.

In information theory, 3 fundamental hypotheses are used which are the following:

the information is defined by the entropy function H(X).
the information that identifies the source is known both by the encoder and by the decoder.
the source and its isomorphisms are considered. It means that we can decode a symbol at a time.

First limit, the most famous, defined by Shannon in which all 3 hypotheses are true.

NH(X) With H(X) entropy of the source X.

Second limit. we remove the second hypothesis the decoder does not know the source.

NH(X)+source information

Third limit, let's remove the third hypothesis. In this case, the Set Shaping Theory SST is used, a new method that is revolutionizing information theory. This theory studies the one-to-one functions f that transform a set of strings into a set of equal size made up of strings of greater length. With this method, we get the following limit:

N2H(Y)+ source information≈NH(X) with f(X)=Y and N2>N.

In practice, we obtain a gain in terms of compression equivalent to the information necessary to describe the source is obtained. The information needed to describe the source represents the inefficiency of the entropy coding.

In this case, however, it is not possible to decode a symbol at a time (the code is not instantaneous) but the message must be decoded in full before obtaining the original message.

Important progress has been made in this area. It was possible to apply this theory to a concrete case of data compression "Practical applications of Set Shaping Theory in Huffman coding".

Another interesting aspect is that the authors shared the code and the function that performs the transform described in the set shaping theory. The file is shared on Matlab file exchange: https://www.mathworks.com/matlabcentral/fileexchange/115590-test-sst-huffman-coding?s_tid=FX_rc1_behav