pandas python-2.7 character-encoding non-ascii-characters

I need to replace non-ASCII characters in pandas data frame column in python 2.7

This question was asked many times, but non of the solutions worked for me.

The data frame was pulled from a third party excel file with 'UTF-8' encoding:

pd.read_excel(file, encoding = 'UTF-8', sheet_name = worksheet)

But I still have characters like " â€™ " instead of " ' " in some lines.

On the top of the code I have the following

# -*- encoding: utf-8 -*-

The following line does not throw errors, but do not change anything in the data:

df['text'] = df['text'].str.replace("â€™","'")

I tried with dictionary (which has the same core), like

    repl_dict = {"â€™": "'"}
    for k,v in repl_dict.items():
        df.loc[df.text.str.contains(k), 'text'] = 
        df.text.str.replace(pat=k,repl=v)

and tried many other approaches including regex, but nothing worked.

When I tried:

def replace_apostrophy(text):
    return text.replace("â€™","'")
df['text'] = df['text'].apply(lambda x: replace_apostrophy(x))

I received the following error - UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

When I tried:

df["text"] = df["text"].apply(lambda text: unicodedata.normalize('NFKD', text))

I got the following error - TypeError: normalize() argument 2 must be unicode, not float

The text has also emojis that afterwords I need to count somehow.

Can someone give me a good advice?

Thank you very much!

Solution

I have found a solution myself. It might look clumsy, but works perfectly in my case:

    df["text"] = df["text"].apply(lambda text: unicodedata.normalize('NFKD', text).encode('ascii','backslashreplace'))

I had to replace nan values prior to run that code.

That operation gives me ascii symbols only that can be easily replaced:

    def replace_apostrophy(text):
return text.replace("a\u0302\u20acTM","'")

Hope this would help someone.