python string replace substring data-cleaning

Remove words from list but keep the ones only made up from the list

I have one dataframe containing strings and one list of words that I want to remove from the dataframe. However, I would like to also keep the strings from the df which are entirely made up of words from the list.

Here is an example:

strings_variable
Avalon Toyota loan
Blazer Chevrolet
Suzuki Vitara sales
Vauxhall Astra
Buick Special car
Ford Aerostar
car refund
car loan

data = {'strings_variable': ['Avalon Toyota loan', 'Blazer Chevrolet', 'Suzuki Vitara sales', 'Vauxhall Astra', 'Buick Special car', 'Ford Aerostar', 'car refund', 'car loan']}
df = pd.DataFrame(data)

words_to_remove = ('car','sales','loan','refund')

The final output should look like this:

strings_variable
Avalon Toyota
Blazer Chevrolet
Suzuki Vitara
Vauxhall Astra
Buick Special
Ford Aerostar
car refund
car loan

data= {'strings_variable': ['Avalon Toyota', 'Blazer Chevrolet', 'Suzuki Vitara', 'Vauxhall Astra', 'Buick Special', 'Ford Aerostar', 'car refund', 'car loan']}
df = pd.DataFrame(data)

Note, the words that I want to remove are in addition to the car names however I would like to keep the rows where the strings are only made of words in words_to_remove

Here is my code (Python) so far:

def remove_words(df):
   df = [word for words in df if word not in words_to_remove]
   return df

strings_variable = strings_variable.apply(remove_words)

I hope it makes sense - thank you in advance!

Solution

I'm assuming you're using pandas, because of your use of df and the .apply() method. However, you need to create the DataFrame itself. Then you can create a function to apply to the Series (if you're only changing the column) or to apply-map to the whole DataFrame (probably not what you're looking for).

import pandas as pd

df = pd.DataFrame({
    'strings_variable': [
        'Avalon Toyota loan',
        'Blazer Chevrolet',
        'Suzuki Vitara sales', 
        'Vauxhall Astra', 
        'Buick Special car', 
        'Ford Aerostar', 
        'car refund', 
        'car loan'
    ]
})

words_to_remove = ('car', 'sales', 'loan', 'refund')

def remove_words(text: str) -> str:
    """Remove stop words if string composed made entirely of them"""
    
    new_text = ' '.join([
        word
        for word in text.split()
        if word not in words_to_remove
    ])
    
    if not new_text:
        new_text = text
        
    return new_text

df['strings_variable'] = df['strings_variable'].apply(remove_words)
# or
df = df.applymap(remove_words) # probably not this one