I have one dataframe containing strings and one list of words that I want to remove from the dataframe. However, I would like to also keep the strings from the df which are entirely made up of words from the list.
Here is an example:
strings_variable |
---|
Avalon Toyota loan |
Blazer Chevrolet |
Suzuki Vitara sales |
Vauxhall Astra |
Buick Special car |
Ford Aerostar |
car refund |
car loan |
data = {'strings_variable': ['Avalon Toyota loan', 'Blazer Chevrolet', 'Suzuki Vitara sales', 'Vauxhall Astra', 'Buick Special car', 'Ford Aerostar', 'car refund', 'car loan']}
df = pd.DataFrame(data)
words_to_remove = ('car','sales','loan','refund')
The final output should look like this:
strings_variable |
---|
Avalon Toyota |
Blazer Chevrolet |
Suzuki Vitara |
Vauxhall Astra |
Buick Special |
Ford Aerostar |
car refund |
car loan |
data= {'strings_variable': ['Avalon Toyota', 'Blazer Chevrolet', 'Suzuki Vitara', 'Vauxhall Astra', 'Buick Special', 'Ford Aerostar', 'car refund', 'car loan']}
df = pd.DataFrame(data)
Note, the words that I want to remove are in addition to the car names however I would like to keep the rows where the strings are only made of words in words_to_remove
Here is my code (Python) so far:
def remove_words(df):
df = [word for words in df if word not in words_to_remove]
return df
strings_variable = strings_variable.apply(remove_words)
I hope it makes sense - thank you in advance!
I'm assuming you're using pandas
, because of your use of df
and the .apply()
method. However, you need to create the DataFrame itself. Then you can create a function to apply to the Series (if you're only changing the column) or to apply-map to the whole DataFrame (probably not what you're looking for).
import pandas as pd
df = pd.DataFrame({
'strings_variable': [
'Avalon Toyota loan',
'Blazer Chevrolet',
'Suzuki Vitara sales',
'Vauxhall Astra',
'Buick Special car',
'Ford Aerostar',
'car refund',
'car loan'
]
})
words_to_remove = ('car', 'sales', 'loan', 'refund')
def remove_words(text: str) -> str:
"""Remove stop words if string composed made entirely of them"""
new_text = ' '.join([
word
for word in text.split()
if word not in words_to_remove
])
if not new_text:
new_text = text
return new_text
df['strings_variable'] = df['strings_variable'].apply(remove_words)
# or
df = df.applymap(remove_words) # probably not this one