I am trying to generate Unicode characters using their escape sequences dynamically in Python. Specifically, I want to generate strings like '\u1950'
, '\u1951'
, etc. However, I am facing issues due to escape characters or Unicode errors.
What I Tried: Attempt 1 (Using a raw string with concatenation):
data = []
da = r'\u' # Raw string prefix
for x in range(1950, 2025):
dat = da + str(x)
data.append(dat)
print(data)
Output:
['\\u1950', '\\u1951', '\\u1952', ...] # Backslashes are doubled
This is not what I expected. Instead of '\u1950'
, I am getting ['\\u1950', '\\u1951', ...]
, where each entry is treated as a normal string, not an actual Unicode escape sequence.
Attempt 2 (Direct String Concatenation – Causes SyntaxError):
box = []
for data in range(1950, 2025):
string = '\u' + str(data) # This raises an error
box.append(string)
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape
Using print()
[print(r'\u' + str(x)) for x in range(1950, 2025)]
output:
\u1950
\u1951
\u1952
But when I try to store these values in a list and print the list, they appear with extra backslashes:
data = []
da = r'\u'
for x in range(1950, 2025):
data.append(da + str(x))
print(data)
output:
['\\u1950', '\\u1951', '\\u1952', ...]
You have:
['\\u1950', '\\u1951', '\\u1952', ...]
When outputting a list with strings like this, Python will use the internal representation of the string - and that is ok, and how it should be: the internal representation is meant for one being able to see an output, type it back in Python source code, and have the same object recreated.
If you want to output a representation that one will be able to copy, paste as Python source, and have a list with the unicode echaracters themselves, then you need a representation with the single slashes, as you are asking. The only way to do it is to create a custom printout of the list and its elements - the automatic Python conversion of a list to its represenation will always use the repr
version of a list.
def custom_repr(sequence):
return "[{}]".format(", ".join(f"'{item}'" for item in sequence))
# and them, using your last code snippet:
data = []
da = r'\u'
for x in range(1950, 2025):
data.append(da + str(x))
print(custom_repr(data))
What "custom_repr" does here is using the "str" version of each item in the data
sequence, and enclosing it in single quotes, besides adding the ", " separators and surrounding [] pair. This is created as text, and can be printed or stored in a file, and when used as Python code, the '\uXXXX' elements will be parsed as a single unicode character - which is probably your goal.
Other than that, it is interesting to also talk about the "unicode escape"
codec, which can convert a plain representation of a list, like your first output into the proper, parsed, unicode character.
Depending on how you intend to use these output, this can be more convenient (no need to customize the output, just proper parse the double-slashes printed when using the default list repr:
my_list = ['\\u2020', '\\u2021', '\\u2022']
my_chars = [char.encode("latin1").decode("unicode escape") for char in my_list]
print(my_chars)
# outputs: ['†', '‡', '•']