Search code examples
pythonunicodenormalizationpython-3.3unicode-normalization

ZWNJ not shown properly in python 3.3


I am trying to replace the space between two tokens written in the Arabic alphabet with a ZWNJ but what the function returns is not decoded properly on the screen:

>>> nm.normalize("رشته ها")
'رشته\u200cها'

\u200 should be rendered as a half-space that would be placed between 'رشته' and 'ها' here, but it gets messed up like that. I am using Python 3.3.3


Solution

  • The function returned a string object with the \u200c character as part of it, but Python shows you the representation. The \uxxxx syntax is used to make the representation useful as a debugging value, you can now copy that representation and paste it back into Python and get the exact same value.

    In other words, the function worked exactly as advertised; the space was indeed replaced by a U+200C ZERO WIDTH NON-JOINER codepoint.

    If you wanted to write the string to your terminal or console, use print():

    print(nm.normalize("رشته ها"))
    

    Demo:

    >>> result = 'رشته\u200cها'
    >>> len(result)
    7
    >>> result[4]
    '\u200c'
    >>> print(result)
    رشته‌ها
    

    You can see that character 5 (index 4) is a single character here, not 6 separate characters.