Search code examples
javastringalgorithmencodingcompression

Compressing a string to a fixed length


I need to achieve string compression to a certain length.

The input is a string consisting of 3 to 16 characters, which can include the English alphabet in lowercase, digits (0-9), and underlines (or regex: ^[a-z0-9_]{3,16}$).

The English alphabet in any case, digits, underlines, brackets, and various other UTF-8 characters can be used to compress the original string. It must be possible to the reverse process: from the compressed string to the original string.

In my case, the specific length of the compressed string is 6. However, I do not know if it is possible to compress a 16-character string into a "6-character code". If this is not possible, then I suggest increasing it to 7-8 characters.

Example:
input: bestnickname
output: D_1(oP

I've been trying to find various algorithms to solve this problem, but I haven't found one.


Solution

  • If, instead of UTF-8 (as specified in the question), the OP really wants to encode into UTF-16, which is how characters are encoded in Java (since the type char is referenced in the comments and the question is tagged "Java"), then indeed the specified 3 to 16 symbols where each symbol takes one of 37 values can be encoded into six 16-bit values constrained to valid UTF-16, with no surrogate values required, i.e. the old UCS-2. In fact, only 14 bits need to be encoded into each UTF-16 character, so a subset of characters can be chosen to be more readable.

    To implement the encoder, first choose 16,384 Unicode characters within the valid code points 0x0000 to 0xffff (which excludes 0xd800 to 0xdfff) to permit in the encoded form. Make a map of 0..16383 to your chosen Unicode characters. Then assign an integer to each 3 to 16 37-value sequence. The calculation can be done using the BigInteger class. This result is an 84-bit integer. Break the 84-bit integer into six 14-bit pieces (fits perfectly!), and write the corresponding code point for each of the 14-bit pieces from your map.

    Here is pseudo-code to compute the integer corresponding to a sequence:

        // len is in 3..16, seq[i] is in 0..36
        bigint base = 0;
        bigint offset = 37 * 37;
        for (int i = 3; i < len; i++) {
            offset *= 37;
            base += offset;
        }
        offset = 0;
        for (int i = 0; i < len; i++)
            offset = 37 * offset + seq[i];
        return base + offset;
    

    Where you would need to use the BigInteger methods for the arithmetic operations.