Search code examples
pythonunicodeunicode-escapes

How to dynamically generate Unicode characters from code points in Python?"


I am trying to generate Unicode characters using their escape sequences dynamically in Python. Specifically, I want to generate strings like '\u1950', '\u1951', etc. However, I am facing issues due to escape characters or Unicode errors.

What I Tried: Attempt 1 (Using a raw string with concatenation):

data = []
da = r'\u'  # Raw string prefix

for x in range(1950, 2025):
    dat = da + str(x)
    data.append(dat)

print(data)

Output:

['\\u1950', '\\u1951', '\\u1952', ...]  # Backslashes are doubled

This is not what I expected. Instead of '\u1950', I am getting ['\\u1950', '\\u1951', ...], where each entry is treated as a normal string, not an actual Unicode escape sequence.

Attempt 2 (Direct String Concatenation – Causes SyntaxError):

box = []
for data in range(1950, 2025):
    string = '\u' + str(data)  # This raises an error
    box.append(string)
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape

Using print()

[print(r'\u' + str(x)) for x in range(1950, 2025)]
output:
\u1950
\u1951
\u1952

But when I try to store these values in a list and print the list, they appear with extra backslashes:

data = []
da = r'\u'
for x in range(1950, 2025):
    data.append(da + str(x))
print(data)

output:

['\\u1950', '\\u1951', '\\u1952', ...]

Solution

  • You have: ['\\u1950', '\\u1951', '\\u1952', ...]

    When outputting a list with strings like this, Python will use the internal representation of the string - and that is ok, and how it should be: the internal representation is meant for one being able to see an output, type it back in Python source code, and have the same object recreated.

    If you want to output a representation that one will be able to copy, paste as Python source, and have a list with the unicode echaracters themselves, then you need a representation with the single slashes, as you are asking. The only way to do it is to create a custom printout of the list and its elements - the automatic Python conversion of a list to its represenation will always use the repr version of a list.

    def custom_repr(sequence):
        return "[{}]".format(", ".join(f"'{item}'" for item in   sequence))
    
    # and them, using your last code snippet:
    data = []
    da = r'\u'
    for x in range(1950, 2025):
        data.append(da + str(x))
    print(custom_repr(data))
    

    What "custom_repr" does here is using the "str" version of each item in the data sequence, and enclosing it in single quotes, besides adding the ", " separators and surrounding [] pair. This is created as text, and can be printed or stored in a file, and when used as Python code, the '\uXXXX' elements will be parsed as a single unicode character - which is probably your goal.


    Other than that, it is interesting to also talk about the "unicode escape" codec, which can convert a plain representation of a list, like your first output into the proper, parsed, unicode character. Depending on how you intend to use these output, this can be more convenient (no need to customize the output, just proper parse the double-slashes printed when using the default list repr:

    my_list = ['\\u2020', '\\u2021', '\\u2022']
    my_chars = [char.encode("latin1").decode("unicode escape") for char in my_list]
    print(my_chars)
    
    # outputs: ['†', '‡', '•']