Search code examples
pythonstringpandasreplacecontains

Pandas : Replace string column values (equal, contains, case)


I have datafarme as below.

ID   COUNTRY   GENDER    AGE  V1   V2   V3   V4   V5
1    1    1    53   APPLE     apple     bosck     APPLE123  xApple111t
2    2    2    51   BEKO beko SIMSUNG   SamsungO123    ttBeko111t
3    3    1    24   SAMSUNG   bosch     SEMSUNG   BOSC1123  uuSAMSUNG111t

I want to replace to np.nan if there are same value in list or contain specific value. I tried below but occurred error.

remove_list = ['APPLE', 'BEKO']

remove_contain_list = ['SUNG', 'bosc']

df.iloc[:,4:].str.replace(remove_list, np.nan, case=False) # exact match & case sensitive
df.iloc[:,4:].str.contains(remove_contain_list, np.nan, case=False) # contain & case sensitive

How can I solve these problems?


Solution

  • You can create MultiIndex Series by DataFrame.stack, get masks for exact and partial matches by Series.isin with lowercase values and Series.str.contains, replace by Series.mask (default value for replace is NaN, so no necessary specify) and last Series.unstack and assign back:

    remove_list = ['APPLE', 'BEKO']
    remove_contain_list = ['SUNG', 'bosc']
    
    s = df.iloc[:,4:].stack(dropna=False)
    m1 = s.str.lower().isin([x.lower() for x in remove_list])
    m2 = s.str.contains('|'.join(remove_contain_list), case=False)
    s = s.mask(m1 | m2)
    
    df.iloc[:,4:] = s.unstack()
    print (df)
       ID  COUNTRY  GENDER  AGE   V1   V2   V3        V4          V5
    0   1        1       1   53  NaN  NaN  NaN  APPLE123  xApple111t
    1   2        2       2   51  NaN  NaN  NaN       NaN  ttBeko111t
    2   3        3       1   24  NaN  NaN  NaN       NaN         NaN
    

    EDIT: You can replace mask to background color if match in Styler.apply:

    def color(x): 
        c1 = 'background-color: yellow'
        c = ''
    
        remove_list = ['APPLE', 'BEKO']
        remove_contain_list = ['SUNG', 'bosc']
    
        s = x.iloc[:,4:].stack(dropna=False)
        m1 = s.str.lower().isin([i.lower() for i in remove_list])
        m2 = s.str.contains('|'.join(remove_contain_list), case=False)
        m = m1| m2
    
        df1 = pd.DataFrame(c, index=x.index, columns=x.columns)
        mask = m.unstack(fill_value=False).reindex(x.columns, fill_value=False, axis=1)   
        df1 = df1.mask(mask, c1)
        return df1
    
    df.style.apply(color,axis=None)