Search code examples
pythonpandasunicode

Python: remove unicode from dataframe


I will use a simple dataframe as an example:

data = {'A': ['公司\u3000', 'aaaa\uf505'], 'B': ['1', '2']}
df = pd.DataFrame(data)
print(df)


       A   B
0  公司   1
1  aaaa   2

I want to remove unicode like '\u3000', '\uf505' from column A, however, there are more unicodes in it that I may not know. So it's not very ideal to use remove method for me. Some rows has Mandarin characters and I have to keep them.

It also did not work when I tried to split '\'

df=df[~df['A'].str.contains(r'\\')]

My expected result:

      A   B
0  公司  1
1  aaaa   2

Solution

  • Code

    df['A'] = df['A'].str.replace(r'[^\x00-\x7F\u4E00-\u9FFF]+', '', regex=True)
    

    df

        A       B
    0   公司  1
    1   aaaa    2
    

    This code removes Unicode characters while preserving chinese characters(漢字).