I will use a simple dataframe as an example:
data = {'A': ['公司\u3000', 'aaaa\uf505'], 'B': ['1', '2']}
df = pd.DataFrame(data)
print(df)
A B
0 公司 1
1 aaaa 2
I want to remove unicode like '\u3000', '\uf505' from column A, however, there are more unicodes in it that I may not know. So it's not very ideal to use remove method for me. Some rows has Mandarin characters and I have to keep them.
It also did not work when I tried to split '\'
df=df[~df['A'].str.contains(r'\\')]
My expected result:
A B
0 公司 1
1 aaaa 2
Code
df['A'] = df['A'].str.replace(r'[^\x00-\x7F\u4E00-\u9FFF]+', '', regex=True)
df
A B
0 公司 1
1 aaaa 2
This code removes Unicode characters while preserving chinese characters(漢字).