Search code examples
pandaspython-2.7character-encodingnon-ascii-characters

I need to replace non-ASCII characters in pandas data frame column in python 2.7


This question was asked many times, but non of the solutions worked for me.

The data frame was pulled from a third party excel file with 'UTF-8' encoding:

pd.read_excel(file, encoding = 'UTF-8', sheet_name = worksheet)

But I still have characters like " ’ " instead of " ' " in some lines.

On the top of the code I have the following

# -*- encoding: utf-8 -*-

The following line does not throw errors, but do not change anything in the data:

df['text'] = df['text'].str.replace("’","'")

I tried with dictionary (which has the same core), like

    repl_dict = {"’": "'"}
    for k,v in repl_dict.items():
        df.loc[df.text.str.contains(k), 'text'] = 
        df.text.str.replace(pat=k,repl=v)

and tried many other approaches including regex, but nothing worked.

When I tried:

def replace_apostrophy(text):
    return text.replace("’","'")
df['text'] = df['text'].apply(lambda x: replace_apostrophy(x)) 

I received the following error - UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

When I tried:

df["text"] = df["text"].apply(lambda text: unicodedata.normalize('NFKD', text)) 

I got the following error - TypeError: normalize() argument 2 must be unicode, not float

The text has also emojis that afterwords I need to count somehow.

Can someone give me a good advice?

Thank you very much!


Solution

  • I have found a solution myself. It might look clumsy, but works perfectly in my case:

        df["text"] = df["text"].apply(lambda text: unicodedata.normalize('NFKD', text).encode('ascii','backslashreplace'))
    

    I had to replace nan values prior to run that code.

    That operation gives me ascii symbols only that can be easily replaced:

        def replace_apostrophy(text):
    return text.replace("a\u0302\u20acTM","'")
    

    Hope this would help someone.