Search code examples
pythonpython-3.xpandasuniqueright-to-left

Remove Right-to-left character \u200f in Python (Hebrew)


I have a dataframe, and I want the unique strings of a specific column. The strings are in Hebrew.

Because I'm using pandas dataframe, I wrote: all_names = history.name.unique() (history is the data frame with a name column).

I get strange duplicates with the \u200f character. Like ערן and another one with the \u200f

all_names
array(['\u200fערן', 'ערן',  ...., None], dtype=object)

How can I remove these characters? (From the original data frame)


Solution

  • You can clear-up your name strings by filtering out all non-letters and non-whitespaces (Unicode-wise) by applying a re.sub-based function to all the values in the name column.

    For example (assuming Python 3, which handles Unicode properly):

    >>> import re
    >>> history.name.apply(lambda s: s and re.sub('[^\w\s]', '', s))
    

    The \w includes all Unicode word characters (including numbers) and \s includes all Unicode whitespace characters.

    By the way, the \u200f (aka the RIGHT-TO-LEFT MARK) that's bothering you is in the Unicode codepoint category "Other, Format":

    >>> import unicodedata
    >>> unicodedata.name('\u200f')
    'RIGHT-TO-LEFT MARK'
    >>> unicodedata.category('\u200f')
    'Cf'
    

    so, you can be sure it'll be removed with the filter above.