c#string encoding compression text-compression

Encode/Decode a given string on a shared given (non standard) charset in a minimal byte array

I'm looking for a generic algorithm which encode / decode a given string on a defined chars set to / from a byte array. It must use minimal space.

I started developping mine which is a kind of Base'n' to Base 2 algorithm, but I think something like that must have already been developped.

My need is to encode in a minimal bits number strings using a known restricted charset. Maybe I should use bzip2?

Edit: My strings length maximum is 160 chars. I can pad them if needed.

Edit2: I must know the worst-case bits number.

byte[] encode(string charset, string value)

string decode(string charset, byte[] encodedValue)

Usage:

string myString = "HELLO WORLD";
string charSet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ "; // Base 27
byte[] encodedString = encode(charset, myString); // Base 27 -> Base 2
Debug.Assert(myString.Equals(decode(charset, encodedString))); // Base 2 -> Base 27

Solution

You can use a simple, fast prefix code that uses either k or k-1 bits per character. Then the worst case is m k bits for m characters.

For base n, let k = ceiling(log₂(n)). Index the symbols from 0 to n-1. If the index, x, of the symbol is less than 2^k-n, then emit x as a k-1 bit integer. Otherwise, emit 2^k-n+x as a k bit integer.

This is much faster than base encoding/decoding which requires multiplication/division respectively. Let's look at an extreme case where the base encoding happens to fit as nicely as possible into 64 bits. (Other than the trivial cases where the base is, for example, 2, 4, 16, or 256.) The best case is when there are 138 symbols, where nine such symbols just fit into 64 bits, and you can use the machine multiplication and division instructions on 64-bit unsigned integers. 138⁹=18151468971815029248, which is 98.4% of 2⁶⁴=18446744073709551616. With the base encoding, there are 7.111 bits per symbol. With the above prefix encoding, there are an average of 7.145 bits per symbol.

The above prefix encoding is an optimal Huffman code for the case where all characters are of equal probability. If that is not the case and you would like to realize some compression, then you can either look at large samples of your data and generate a fixed Huffman code for the characters, or you can Huffman code each message individually. In the latter case you would have the overhead of transmitting the message-unique Huffman code with each message, which would require a certain compressibility and long messages to realize a gain.