I'm trying to use spacy to remove stopwords from a panda dataframe created from a csv. My issue is that I'm trying to account for words that might have a mix of words and numbers.
My issue:
If a number separates a word so that it contains a stop word, it will delete that portion of the word.
Ex. With stop word at the end
Input: 'co555in'
Breaks up the word, separating it in 'co'+ 555 + 'in'
Removes 'in' because it is a stop word.
Output: 'co555'
Ex. Without stop word at the end
Input: 'co555inn'
Breaks up the word, separating it in 'co'+ 555 + 'inn'
Will not remove 'inn' because it is not a stop word.
Output: 'co555inn'
Current implementation:
df[col] = df[col].apply(lambda text:
"".join(token.lemma_ for token in nlp(text)
if not token.is_stop))
So what I'd like is to be able to account for numbers and words mixed without spacy filtering out the portion of the word if the number separates the string so that it contains a stopword.
UPDATE: According to the devs this is a feature and not a bug. So a workaround, like the answer below, is necessary to account for these edge cases.
Edit #2: Simplified the code. Added functionality to remove any words that have numerical characters from the text using Python's regular expressions library then tokenized all the other text. Also added additional safeguards to ensure that punctuation does not cause an error.
Here is my remove_stopwords
method, with some additional code included that I used for testing.
import spacy
import pandas as pd
import re
nlp = spacy.load('en_core_web_sm')
def remove_stopwords(text):
"""
Removes stop words from a text source
"""
number_words = re.findall(r'\w*\d+\w*', text)
remove_numbers = re.sub(r'\w*\d+\w*', '', text)
split_text = re.split(r'\W+', remove_numbers)
remove_stop_words = [word for word in split_text if not nlp.vocab[word].is_stop]
final_words = number_words + remove_stop_words
return " ".join(final_words)
df = pd.read_csv('input_file.csv', sep='\t') # replace with your CSV file
df['text'] = df['text'].apply(remove_stopwords)
df.to_csv('output_file.csv', index=False) # replace with your desired output file name