I am trying to replace the space between two tokens written in the Arabic alphabet with a ZWNJ but what the function returns is not decoded properly on the screen:
>>> nm.normalize("رشته ها")
'رشته\u200cها'
\u200 should be rendered as a half-space that would be placed between 'رشته' and 'ها' here, but it gets messed up like that. I am using Python 3.3.3
The function returned a string object with the \u200c
character as part of it, but Python shows you the representation. The \uxxxx
syntax is used to make the representation useful as a debugging value, you can now copy that representation and paste it back into Python and get the exact same value.
In other words, the function worked exactly as advertised; the space was indeed replaced by a U+200C ZERO WIDTH NON-JOINER codepoint.
If you wanted to write the string to your terminal or console, use print()
:
print(nm.normalize("رشته ها"))
Demo:
>>> result = 'رشته\u200cها'
>>> len(result)
7
>>> result[4]
'\u200c'
>>> print(result)
رشتهها
You can see that character 5 (index 4) is a single character here, not 6 separate characters.