Search code examples
pythonstringreplacesubstringdata-cleaning

Remove words from list but keep the ones only made up from the list


I have one dataframe containing strings and one list of words that I want to remove from the dataframe. However, I would like to also keep the strings from the df which are entirely made up of words from the list.

Here is an example:

strings_variable
Avalon Toyota loan
Blazer Chevrolet
Suzuki Vitara sales
Vauxhall Astra
Buick Special car
Ford Aerostar
car refund
car loan
data = {'strings_variable': ['Avalon Toyota loan', 'Blazer Chevrolet', 'Suzuki Vitara sales', 'Vauxhall Astra', 'Buick Special car', 'Ford Aerostar', 'car refund', 'car loan']}
df = pd.DataFrame(data)
words_to_remove = ('car','sales','loan','refund')

The final output should look like this:

strings_variable
Avalon Toyota
Blazer Chevrolet
Suzuki Vitara
Vauxhall Astra
Buick Special
Ford Aerostar
car refund
car loan
data= {'strings_variable': ['Avalon Toyota', 'Blazer Chevrolet', 'Suzuki Vitara', 'Vauxhall Astra', 'Buick Special', 'Ford Aerostar', 'car refund', 'car loan']}
df = pd.DataFrame(data)

Note, the words that I want to remove are in addition to the car names however I would like to keep the rows where the strings are only made of words in words_to_remove

Here is my code (Python) so far:

def remove_words(df):
   df = [word for words in df if word not in words_to_remove]
   return df

strings_variable = strings_variable.apply(remove_words)

I hope it makes sense - thank you in advance!


Solution

  • I'm assuming you're using pandas, because of your use of df and the .apply() method. However, you need to create the DataFrame itself. Then you can create a function to apply to the Series (if you're only changing the column) or to apply-map to the whole DataFrame (probably not what you're looking for).

    import pandas as pd
    
    df = pd.DataFrame({
        'strings_variable': [
            'Avalon Toyota loan',
            'Blazer Chevrolet',
            'Suzuki Vitara sales', 
            'Vauxhall Astra', 
            'Buick Special car', 
            'Ford Aerostar', 
            'car refund', 
            'car loan'
        ]
    })
    
    words_to_remove = ('car', 'sales', 'loan', 'refund')
    
    def remove_words(text: str) -> str:
        """Remove stop words if string composed made entirely of them"""
        
        new_text = ' '.join([
            word
            for word in text.split()
            if word not in words_to_remove
        ])
        
        if not new_text:
            new_text = text
            
        return new_text
    
    df['strings_variable'] = df['strings_variable'].apply(remove_words)
    # or
    df = df.applymap(remove_words) # probably not this one