I am trying to remove some erroneous data from two columns in a DataFrame. The columns are subject to corruption where symbols occur within the columns values. I want to check all values in two columns and replace identified values with '' when a symbol is present.
For example:
import pandas as pd
bad_chars = [')', ',', '@', '/', '!', '&', '*', '.', '_', ' ']
d = {'p1' : [1,2,3,4,5,6],
'p2' : ['abc*', 'abc@', 'zxya', '&sdf', 'p xx', 'abcd'],
'p3' : ['abc', 'abc.', 'zxya', '&sdf', 'p xx', 'abcd']}
df = pd.DataFrame(d)
p1 p2 p3
0 1 abc* abc
1 2 abc@ abc.
2 3 zxya zxya
3 4 &sdf &sdf
4 5 p xx p xx
5 6 abcd abcd
I have been trying unsuccessfully to uses list comprehensions to iterate over the bad_chars variable and replace the value in columns p2 and p3 with empty '' resulting in something like this:
p1 p2 p3
0 1 abc
1 2
2 3 zxya zxya
3 4
4 5
5 6 abcd abcd
Once I have achieved this I would like to remove any rows containing an empty cell in either column p2, p3 or both.
p1 p2 p3
0 3 zxya zxya
1 6 abcd abcd
Here you go:
import pandas as pd
bad_chars = ['\,', '\@', '\/', '\!', '\&', '\*', '\.', '\_', '\ ']
d = {'p1' : [1,2,3,4,5,6],
'p2' : ['abc*', 'abc@', 'zx_ya', '&sdf', 'p xx', 'abcd'],
'p3' : ['abc', 'abc.', 'zxya', '&sdf', 'p xx', 'abcd']}
df = pd.DataFrame(d)
df.loc[df['p2'].str.contains('|'.join(bad_chars)), 'p2'] = None
df.loc[df['p3'].str.contains('|'.join(bad_chars)), 'p3'] = None
df = df.dropna(subset=['p2', 'p3'])
df
note that I have changed bad_chars (added \ to them)