python unicode normalization python-3.3 unicode-normalization

ZWNJ not shown properly in python 3.3

I am trying to replace the space between two tokens written in the Arabic alphabet with a ZWNJ but what the function returns is not decoded properly on the screen:

>>> nm.normalize("رشته ها")
'رشته\u200cها'

\u200 should be rendered as a half-space that would be placed between 'رشته' and 'ها' here, but it gets messed up like that. I am using Python 3.3.3

Solution

The function returned a string object with the \u200c character as part of it, but Python shows you the representation. The \uxxxx syntax is used to make the representation useful as a debugging value, you can now copy that representation and paste it back into Python and get the exact same value.

In other words, the function worked exactly as advertised; the space was indeed replaced by a U+200C ZERO WIDTH NON-JOINER codepoint.

If you wanted to write the string to your terminal or console, use print():

print(nm.normalize("رشته ها"))

Demo:

>>> result = 'رشته\u200cها'
>>> len(result)
7
>>> result[4]
'\u200c'
>>> print(result)
رشته‌ها

You can see that character 5 (index 4) is a single character here, not 6 separate characters.