Search code examples
pythonloopsunicodebackslash

Iterating over Unicode Characters


I wanted to loop over Unicode-Characters in Python like this:

hex_list = "012346789abcdef"
for _1 in hex_list:
    for _2 in hex_list:
        for _3 in hex_list:
            for _4 in hex_list:
                my_char = r"\u" + _1 + _2 + _3 + _4
                print(my_char)

As expected this printed out:

\u0000
\u0001
...
\uffff

Then I tried to change the code above to print not the Unicode but the corresponding Characters:

hex_list = "012346789abcdef"
for _1 in hex_list:
    for _2 in hex_list:
        for _3 in hex_list:
            for _4 in hex_list:
                my_char = r"\u" + _1 + _2 + _3 + _4
                eval("print(my_char)")

But this outputs the same as the code before.

hex_list = "012346789abcdef"
for _1 in hex_list:
    for _2 in hex_list:
        for _3 in hex_list:
            for _4 in hex_list:
                eval("print(" + r"\u" + f"{_1}{_2}{_3}{_4})")

And something like this raises following errow message:

eval("print(" + r"\u" + f"{_1}{_2}{_3}{_4})")
  File "<string>", line 1
    print(\u0000)
                ^
SyntaxError: unexpected character after line continuation character

What would make this code work as intended?


Solution

  • Python strings are Unicode already. Unicode isn't some kind of escape sequence, it's a way of mapping characters to bytes.

    Given that fact, you can use chr to convert a Unicode code point to a string with that character, eg print(chr(1081)). As the function's docs say:

    Return the string representing a character whose Unicode code point is the integer i. For example, chr(97) returns the string 'a', while chr(8364) returns the string '€'. This is the inverse of ord().

    The valid range for the argument is from 0 through 1,114,111

    A simple loop can generate all valid characters. Actually printing them is another matter:

    for i in range(0, 1114112 ):
        print(chr(i))
    

    Running this on my machine eventually fails with

    UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed

    That value couldn't be converted in a form that can be printed on my terminal, which uses UTF8