python python-3.x pandas unique right-to-left

Remove Right-to-left character \u200f in Python (Hebrew)

I have a dataframe, and I want the unique strings of a specific column. The strings are in Hebrew.

Because I'm using pandas dataframe, I wrote: all_names = history.name.unique() (history is the data frame with a name column).

I get strange duplicates with the \u200f character. Like ערן and another one with the \u200f

all_names
array(['\u200fערן', 'ערן',  ...., None], dtype=object)

How can I remove these characters? (From the original data frame)

Solution

You can clear-up your name strings by filtering out all non-letters and non-whitespaces (Unicode-wise) by applying a re.sub-based function to all the values in the name column.

For example (assuming Python 3, which handles Unicode properly):

>>> import re
>>> history.name.apply(lambda s: s and re.sub('[^\w\s]', '', s))

The \w includes all Unicode word characters (including numbers) and \s includes all Unicode whitespace characters.

By the way, the \u200f (aka the RIGHT-TO-LEFT MARK) that's bothering you is in the Unicode codepoint category "Other, Format":

>>> import unicodedata
>>> unicodedata.name('\u200f')
'RIGHT-TO-LEFT MARK'
>>> unicodedata.category('\u200f')
'Cf'

so, you can be sure it'll be removed with the filter above.