I have a list of word that I want to remove from a given text. With my limited python knowledge, I tried to replace those list of words with null value in a loop. It worked ok but the problem is it replaced all string matched to it even chunk of a word. Please look the code and output below:
word_list = {'the', 'mind', 'pen'}
def remove_w(text):
for word in word_list:
text = text.replace(word, '')
return text
remove_w('A pencil is over a thermometer with mind itself.')
The output is:
'A cil is over a rmometer with itself.'
It removed part of some words. However, clearly I wanted the following output below.
A pencil is over a thermometer with itself.
How to remove such list of words from a text ONLY IF it is a whole word, not a part of a word. (Since I will use it on large articles, please suggest a way that is faster approach) Thank you.
You can use a regular expression with word boundaries.
pattern = re.compile('|'.join(rf'\b{re.escape(w)}\b' for w in word_list))
def remove_w(text):
return pattern.sub('', text)
Alternatively, use str.split
to separate into words delimited by spaces, remove the words exactly matching one of those in the set
, then join it back together.
def remove_w(text):
return ' '.join(w for w in text.split() if w not in word_list)