Search code examples
pythonunicode

Combine characters to create an even higher one?


I am trying to combine characters from chr(0) - chr(255) to create a character above chr(256) is it possible to achieve this? here is a basic code of what I am doing:

import pyclip

rand_char = [random.randint(0, 255) for i in range(random.randint(1,4))]

# what to do here?
combined_char = create_higher_char(rand_char)

pyclip.copy(combined_char)

Basically for example you can get chr(1499) via : '\xd7\x9b'

>>> chr(1499)
'כ'
>>> chr(1499).encode()
b'\xd7\x9b'

So when I get '\xd7\x9b' on rand_char I should be able to copy chr(1499) instead of '\xd7\x9b' which has a value of :

>>> [ord(x) for x in '\xd7\x9b']
[215, 155]

I tried creating a list of possible combinations of encoded characters but some characters do not have .encode()

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

How is it possible to get the value 1499 from [215, 155]?


Solution

  • The values you obtain by using .encode have nothing to do with the values you create a char with by using the chr() call:

    The second one takes one number and uses it as the Unicode code point for the character: this is a numeric space from 32 to 2**20 (but with empty spaces declared as not used, or reserved for future use) where a number corresponds to a character.

    When you call .encode() without passing a parameter, Python will use the utf-8 encoding, which is a variable length encoding - utf-8 byte sequences for characters do not correspond to the codepoints: rather there are a set of prefix-bytes that indicate if a character is encoded in 2, 3 or 4 bytes - or, if it lies in the 0-128 range, it is represented as a single byte (for these characters, which correspond to the ASCII characters, and only these, the utf-8 representation will match the codepoint).

    So, now you know all of this, to get from the sequence b'\xd7\x9b' to the codepoint 1499, you have first to decode the bytes from their utf-8 representation, and then call the ord() built-in function with the resulting character to get its codepoint. ord() will always operate on strings (never on bytes) of one single character, and is the converse call to chr():

    In [18]: b'\xd7\x9b'.decode("utf-8")
    Out[18]: 'כ'
    
    In [19]: ord(b'\xd7\x9b'.decode("utf-8"))
    Out[19]: 1499
    

    If you want to get byte strings matching the character code-points, one alternative is to use the utf-32 encoding: it will always use 4 bytes per character, and each 4 bytes, when transformed to an int, will match the codepoint for that char:

    In [26]: a = 'כ'
    
    In [27]: a.encode("utf-32le")
    Out[27]: b'\xdb\x05\x00\x00'
    
    In [28]: int.from_bytes(a.encode("utf-32le"), "little")
    Out[28]: 1499