I was wondering about how far we can take lossless data compression; I wasn't able to find an online simulator of lossless algorithms to perform some empiric testing. I could just make one on my own, but unluckily I don't have enough time in this period; still I'm curious about an intuition I had, which I'm going to explain.
Let's take just two of the more popular algorithms: Huffman Coding
and Run-lenght Enconding
.
Let's suppose that we have an alphabet of a number A
symbols and an arbitrary long sequence of symbols from that alphabet: for example:
Alphabet = {A, B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, X, W, Y, Z}
Sequence1 = SFBJNERUSNJGSDKKDEIJGNMSDJDDSUSJNF
Sequence2 = MNMNMNREGUHSDFJUF
Sequence3 = MMMMMNNNNNASDUERJGASIUJMMMNNNUNS
Now, if we just code each symbol with a fixed-lengh word of n
bit we have the uncompressed sequence, that is long, say, N
bits.
If we code a sequence using Huffman, we will use H
bits instead of N
bits, saving (1-H/N)*100%
bits of space.
If we code that same sequence using RLE, we will use R
bits, saving (1-R/N)*100%
.
I wonder, what happens if we applying RLE + Huffman
or Huffman + RLE
we could achieve more space saving than using only one of them.
It seems a pretty basic idea to me, but googling I didn't manage to find anything intersting on this topic.
EDIT: Mmm, I probably did not consider that if I use RLE first, I won't be able to use Huffman because the corrispondence with the fixed-length code of a single symbol would be lost; but still it should be possible to use Huffman first and RLE after that.
Btw I am interested to the logic in it, I mean using in series more than one lossless compression algorithm.
EDIT 2: While I was commenting on the reply of Mark Adler, I realized that I may have found the answer to my question. This is it:
How could Huffman, which is a symbol-to-symbol code, affect the detection?
Let's say we have the following code: AABCABAAB
.
In plain binary, it would be encoded as 00 00 01 10 00 01 00 00 01
(obv spaces are just for readability purpose).
Huffman would code it as 0 0 11 10 0 11 0 0 11
. Spaces show even more how the string is not changed, so it is possible to detect exactly the same amount of repetitions, assuming that we are approaching the code at this scope (symbol-wise).
This led me to another point (which I won't discuss here given that this is already a very long comment): analyzing the code bit-wise.
So, I actually think I came to a conclusion: as we know, any dictionary (or substitution-based) encoder groups a variable number of symbols to fixed-length codewords, while Huffman (or any other entropy encoder) codes fixed-length symbols into a variable number of bits, thus approximating the entropy to the minimum; ergo, it would be pointless (and only a waste of computation) to let Huffman run first, since the other algorithm will most likely introduce more redundancy which Huffman could be able to reduce.
Of course this is only a theoretical dissertation, because in practice we can take into account other factors (e.g. computation time)... not to mention the fact that different configurations of the string to be coded could lead to different results. But, well... it makes sense to me. :)
Those sorts of combinations are done routinely. You should read some books on compression.
LZ77 is a more general form of RLE that copies repeats of previous strings. (A match of distance one and length n codes a run of n copies of the last bytes.) LZ77 is normally followed by Huffman, Arithmetic, or Range coding.
Huffman should follow LZ77 or RLE, not precede it. The Huffman coding will make it harder, not easier, to detect the repetition. In order to use RLE first, you simply extend your symbol set beyond the literals. As one example, the Deflate format used in zip, gzip, zlib, etc. has a 286 symbol set to code the 256 literal bytes, 29 length prefix codes, and one end of stream code. Each of the 29 length prefix codes imply a zero to five bits suffix that follows the code to add to a base value to get the length. The presence of a length code and its suffix implies that it is followed by another Huffman code, one of 30 distance code prefixes, each with a suffix of 0 to 13 bits, to specify the distance back of the match.
There are many other transforms (which may or may not compress on their own) that are followed by entropy coding. These include the Burrows-Wheeler transform (BWT), the Move-To-Front transform (MTF), which often follows BWT, discrete-cosine transform for images (which can be done lossless, but is most often used in lossy compression algorithms), etc. The transforms group commonality in the data in a reversible manner, providing more gain to the entropy coding step.