Search code examples
compressionentropyinformation-theory

Is there an algorithm for "perfect" compression?


Let me clarify, I'm not talking about perfect compression in the sense of an algorithm that is able to compress any given source material, I realize that is impossible. What I'm trying to get at is an algorithm that is able to encode any source string of bits to it's absolute maximum compressed state, as determined by it's Shannon entropy.

I believe I have heard some things about Huffman Coding being in some sense optimal, so I believe that this encryption scheme might be based off that, but here is my issue:

Consider the bit-strings: a = "101010101010", b = "110100011010".

Using plain Shannon entropy, these bit strings should have the exact same entropy when we consider the bit strings as simply symbols of 0's and 1's, but this approach is flawed, because we can intuitively see that bitstring a has less entropy than bitstring b because it is simply a pattern of repeated 10's. With this in mind, we could get a better idea of the actual entropy of the source by calculating the Shannon entropy for the composite symbols 00, 10, 01, and 11.

This is just my understanding, and I could be totally off base, but from what I understand, for an ergodic source to be truly random, for an ergodic source with length n. the statistical probability of all n-length groups of symbols must be equally likely.

I suppose to be more specific than the question in the title, I have three main questions:

Does Huffman encoding using single bits as symbols compress a bitstring like a optimally, even with an obvious pattern that occurs when we analyze the string at the level of 2-bit symbols? If not, could one optimally compress a source by cycling through different "levels" (sorry if I'm butchering the terminology here) of Huffman coding until the best compression rate is found? Could going through different "rounds" of Huffman coding further increase the compression rate in some instances? (e.a. first go through Huffman coding with symbols that are 5 bits long, then going through Huffman coding for symbols that are 4 bits long? huff_4bits(huff_5bits(bitstring)) )


Solution

  • As stated by Mark, the general answer is "no", due to Kolmogorov complexity. Let me expand a bit on that.

    Compression is basically two steps : 1) Model 2) Entropy

    The role of the model is to "guess" the next bytes or fields to come. Model can have any form, and there is no limit to its effectiveness. A trivial example is a random number generator function : from an external perspective, it looks like a noise, and therefore cannot be compressed. But if you know the generation function, an infinitely long sequence can be compressed into a small set of code, the generator function.

    That's why there is "no limit", and Kolmogorov complexity just states that : you can never guarantee that there is not a better way to "model" the data.

    The second part is computable : Entropy is where you find the "Shannon Limit". Given a set of symbols (typically, the output symbols from the model), which are part of an alphabet, you can compute the optimal cost, and find a way to reach the proven ultimate compression limit, which is the Shannon limit.

    Huffman is optimal with regards to the Shannon limit if you accept the limitation that each symbol must be encoded using an integer number of bits. This is close but imperfect approximation. Better compression can be achieved by using fractional bits, which is what Arithmetic Coders do offer, or the more recent ANS-based Finite State Entropy coder. Both get much closer to the Shannon limit.

    The Shannon limit only applies if you treat a set of symbols "individually". As soon as you try to "combine them", or find any correlations between the symbols, you are "modeling". And this is the territory of Kolmogorov Complexity, which is not computable.