I have a dataframe, and I want the unique strings of a specific column. The strings are in Hebrew.
Because I'm using pandas dataframe, I wrote: all_names = history.name.unique()
(history
is the data frame with a name
column).
I get strange duplicates with the \u200f
character. Like ערן
and another one with the \u200f
all_names
array(['\u200fערן', 'ערן', ...., None], dtype=object)
How can I remove these characters? (From the original data frame)
You can clear-up your name
strings by filtering out all non-letters and non-whitespaces (Unicode-wise) by apply
ing a re.sub
-based function to all the values in the name
column.
For example (assuming Python 3, which handles Unicode properly):
>>> import re
>>> history.name.apply(lambda s: s and re.sub('[^\w\s]', '', s))
The \w
includes all Unicode word characters (including numbers) and \s
includes all Unicode whitespace characters.
By the way, the \u200f
(aka the RIGHT-TO-LEFT MARK
) that's bothering you is in the Unicode codepoint category "Other, Format":
>>> import unicodedata
>>> unicodedata.name('\u200f')
'RIGHT-TO-LEFT MARK'
>>> unicodedata.category('\u200f')
'Cf'
so, you can be sure it'll be removed with the filter above.