Search code examples
pythonpandasstringstring-comparison

Remove word within a string based on another columns value


I have two columns that is a combination of comma separated words and single words in a string format. col1 will always only contain one word. In this example I will use the word Dog as the word to have in col1, but this will differ in the real data, so please do not make a solution that uses regex on Dog specifically.

df = pd.DataFrame({"col1": ["Dog", "Dog", "Dog", "Dog"],
                     "col2": ["Cat, Mouse", "Dog", "Cat", "Dog, Mouse"]})

I want to check if the word in col1 appears in the string in col2, and if it does, I want to remove that word from col2. But keep in mind that I want to keep the rest of the string if there are more words left. So it will go from this:

    col1    col2    
0   Dog     Cat, Mouse
1   Dog     Dog
2   Dog     Cat
3   Dog     Dog, Mouse

To this:

    col1    col2
0   Dog     Cat, Mouse
1   Dog 
2   Dog     Cat
3   Dog     Mouse

Solution

  • (^,|,$) to handle starting & trailing comma
    (,\s|,) will remove comma those getting retained after replace operation.
    {1,} to skip non-repeated comma

    df['col2'] = df['col2'].str. \
        replace("|".join(df['col1'].unique()), "").str.strip() \
        .str.replace("(?:^,|,$)", "") \
        .str.replace("(?:,\s|,){1,}", ",")
    
      col1          col2
    0  Dog     Cat,Mouse
    1  Dog              
    2  Dog           Cat
    3  Dog   Mouse,Mouse