This question was asked many times, but non of the solutions worked for me.
The data frame was pulled from a third party excel file with 'UTF-8' encoding:
pd.read_excel(file, encoding = 'UTF-8', sheet_name = worksheet)
But I still have characters like " ’ " instead of " ' " in some lines.
On the top of the code I have the following
# -*- encoding: utf-8 -*-
The following line does not throw errors, but do not change anything in the data:
df['text'] = df['text'].str.replace("’","'")
I tried with dictionary (which has the same core), like
repl_dict = {"’": "'"}
for k,v in repl_dict.items():
df.loc[df.text.str.contains(k), 'text'] =
df.text.str.replace(pat=k,repl=v)
and tried many other approaches including regex, but nothing worked.
When I tried:
def replace_apostrophy(text):
return text.replace("’","'")
df['text'] = df['text'].apply(lambda x: replace_apostrophy(x))
I received the following error - UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
When I tried:
df["text"] = df["text"].apply(lambda text: unicodedata.normalize('NFKD', text))
I got the following error - TypeError: normalize() argument 2 must be unicode, not float
The text has also emojis that afterwords I need to count somehow.
Can someone give me a good advice?
Thank you very much!
I have found a solution myself. It might look clumsy, but works perfectly in my case:
df["text"] = df["text"].apply(lambda text: unicodedata.normalize('NFKD', text).encode('ascii','backslashreplace'))
I had to replace nan values prior to run that code.
That operation gives me ascii symbols only that can be easily replaced:
def replace_apostrophy(text):
return text.replace("a\u0302\u20acTM","'")
Hope this would help someone.