Search code examples
pythonpandasdataframelist-comprehension

Replace values of Pandas DataFrame columns based upon substrings of symbols in a list


I am trying to remove some erroneous data from two columns in a DataFrame. The columns are subject to corruption where symbols occur within the columns values. I want to check all values in two columns and replace identified values with '' when a symbol is present.

For example:

import pandas as pd

bad_chars = [')', ',', '@', '/', '!', '&', '*', '.', '_', ' ']


d = {'p1' : [1,2,3,4,5,6],
    'p2' : ['abc*', 'abc@', 'zxya', '&sdf', 'p xx', 'abcd'],
    'p3' : ['abc', 'abc.', 'zxya', '&sdf', 'p xx', 'abcd']}

df = pd.DataFrame(d) 

    p1  p2      p3
0   1   abc*    abc
1   2   abc@    abc.
2   3   zxya    zxya
3   4   &sdf    &sdf
4   5   p xx    p xx
5   6   abcd    abcd

I have been trying unsuccessfully to uses list comprehensions to iterate over the bad_chars variable and replace the value in columns p2 and p3 with empty '' resulting in something like this:

    p1  p2      p3
0   1           abc
1   2           
2   3   zxya    zxya
3   4       
4   5       
5   6   abcd    abcd

Once I have achieved this I would like to remove any rows containing an empty cell in either column p2, p3 or both.

    p1  p2      p3
0   3   zxya    zxya
1   6   abcd    abcd

Solution

  • Here you go:

    import pandas as pd
    
    bad_chars = ['\,', '\@', '\/', '\!', '\&', '\*', '\.', '\_', '\ ']
    
    
    d = {'p1' : [1,2,3,4,5,6],
        'p2' : ['abc*', 'abc@', 'zx_ya', '&sdf', 'p xx', 'abcd'],
        'p3' : ['abc', 'abc.', 'zxya', '&sdf', 'p xx', 'abcd']}
    
    df = pd.DataFrame(d)
    df.loc[df['p2'].str.contains('|'.join(bad_chars)), 'p2'] = None
    df.loc[df['p3'].str.contains('|'.join(bad_chars)), 'p3'] = None
    df = df.dropna(subset=['p2', 'p3'])
    df
    

    note that I have changed bad_chars (added \ to them)