Search code examples
pythonunicodeasciichrord

How to compress text length using Unicode and bases, and be able to revert it back?


There are 44 characters in this text: the quick brown fox jumped over the lazy dog

The same text can be represented with just 11 Unicode characters: 񜥎񐟾𴬔񇒉𚫔𮹂𓻣񥯨񜥎𵁼񽤙
(These characters look the same as "[]" but they are all different characters!)

This is because the ASCII characters can range from 1-27 (base 27 if you only use the 27 characters in this character set abcdefghijklmnopqrstuvwxyz ) and Unicode characters range from 1-1114112, which means you can store multiple numbers in a bigger number if you do indices-related math.

For example, the text this looks like [19, 7, 8, 18] if you convert each character to their index in the above base 27 character set. If you do the calculation below:

19 x 27 ^ 0 +  
7  x 27 ^ 1 +  
8  x 27 ^ 2 +  
18 x 27 ^ 3 = 360334

You will get a unique number 360334, which happens to be within 1-1114112 so you can do chr(360334) to get the Unicode character 񗾎. To go back, you do ord('񗾎') to get 360334, which you can continuously divmod to get back the numbers shown below:

360334 %  27 = 19
360334 // 27 = 13345
13345  %  27 = 7
13345  // 27 = 494
494    %  27 = 8
494    // 27 = 18
18     %  27 = 18
18     // 27 = 0 BREAK

The question is: How to make this as a convert and revert function in Python?

Here is my attempt:

def power_sum(values, base, offset = 0):
    return sum(value * base ** (index + offset) for index, value in enumerate(values))

def convert_text(text, chars):
    base = len(chars)
    chars =  {char : index for index, char in enumerate(chars)}
    temp = []
    result = ''
    for index, char in enumerate(text):
        value = chars[char] # indexerror = missing that char in char set
        if power_sum(temp, base, 1) + value > 0x10FFFF: # U+10FFFF is max unicode code point
            result += chr(power_sum(temp, base))
            temp = [value]
        else:
            temp.append(value)
    result += chr(power_sum(temp, base))
    return result
    
def revert_text(text, chars):
    base = len(chars)
    chars = list(chars)
    result = ''
    for char in text:
        value = ord(char)
        while value:
            result += chars[int(value % base)]
            value //= base
    return result

chars = 'abcdefghijklmnopqrstuvwxyz '
print('Base:', len(chars), end = '\n\n')

texts = [
    'this',
    'the quick brown fox jumped over the lazy dog',
    'china'
]

for text in texts:
    print('Start text ({}): {}'.format(len(text), text))
    
    text = convert_text(text, chars)
    print('Unicode text ({}): {}'.format(len(text), text))
    
    text = revert_text(text, chars)
    print('Revert text ({}): {}'.format(len(text), text), end = '\n\n')

Output:

Base: 27

Start text (4): this
Unicode text (1): 񗾎
Revert text (4): this

Start text (44): the quick brown fox jumped over the lazy dog
Unicode text (11): 񽭂늺񒂴񿙳򁈌񊖞񇻉񿿸񽭂񷲄🖛
Revert text (44): the quick brown fox jumped over the lazy dog

Start text (5): china
Unicode text (2): 𿼎
Revert text (4): chin

It fails with the string china for some reason.


Solution

  • Thanks to chrslg's answer for showing that the code stores any number of trailing index 0 characters as 0 after the indices math. I fixed the code by shifting all the characters in the charset by 1, essentially making the 27 char set = base 28.

    def power_sum(values, base, offset = 0):
        return sum(value * base ** (index + offset) for index, value in enumerate(values))
    
    def convert_text(text, chars):
        base = len(chars) + 1
        chars =  {char : index + 1 for index, char in enumerate(chars)}
        temp = []
        result = ''
        for char in text:
            value = chars[char] # indexerror = missing that char in char set
            if value * base ** len(temp) + power_sum(temp, base, 1) > 0x10FFFF: 
                # U+10FFFF is max unicode code point
                result += chr(power_sum(temp, base))
                temp = [value]
            else:
                temp.append(value)
        result += chr(power_sum(temp, base))
        return result
        
    def revert_text(text, chars):
        base = len(chars) + 1
        chars = [None] + list(chars)
        result = ''
        for char in text:
            value = ord(char)
            while value:
                result += chars[int(value % base)]
                value //= base
        return result
    
    chars = 'abcdefghijklmnopqrstuvwxyz '
    print('Base:', len(chars) + 1, end = '\n\n')
    
    texts = [
        'this',
        'the quick brown fox jumped over the lazy dog',
        'china',
        'aaa',
        'aaaaa',
        'aaaaaaaa',
        'aaaab'
    ]
    
    for text in texts:
        print('Start text ({}): {}'.format(len(text), text))
        
        text = convert_text(text, chars)
        print('Unicode text ({}): {}'.format(len(text), text))
        
        text = revert_text(text, chars)
        print('Revert text ({}): {}'.format(len(text), text), end = '\n\n')
    

    Output:

    Base: 28
    
    Start text (4): this
    Unicode text (1): 񧧄
    Revert text (4): this
    
    Start text (44): the quick brown fox jumped over the lazy dog
    Unicode text (12): 򑮄𑼭񡂟򓢳䬪񉱃򑠜񡥇𜞋򋧻񑖍
    Revert text (44): the quick brown fox jumped over the lazy dog
    
    Start text (5): china
    Unicode text (2): 񌳳
    Revert text (5): china
    
    Start text (3): aaa
    Unicode text (1): ̭
    Revert text (3): aaa
    
    Start text (5): aaaaa
    Unicode text (2): 壭
    Revert text (5): aaaaa
    
    Start text (8): aaaaaaaa
    Unicode text (2): 壭壭
    Revert text (8): aaaaaaaa
    
    Start text (5): aaaab
    Unicode text (2): 壭
    Revert text (5): aaaab