Search code examples
pythontextreplace

How to remove a list of words from a text ONLY IF it is a whole word, not a part of a word


I have a list of word that I want to remove from a given text. With my limited python knowledge, I tried to replace those list of words with null value in a loop. It worked ok but the problem is it replaced all string matched to it even chunk of a word. Please look the code and output below:

word_list = {'the', 'mind', 'pen'}
def remove_w(text):
  for word in word_list:
    text = text.replace(word, '')
  return text
remove_w('A pencil is over a thermometer with mind itself.')

The output is:

'A cil is over a rmometer with itself.'

It removed part of some words. However, clearly I wanted the following output below.

A pencil is over a thermometer with itself.

How to remove such list of words from a text ONLY IF it is a whole word, not a part of a word. (Since I will use it on large articles, please suggest a way that is faster approach) Thank you.


Solution

  • You can use a regular expression with word boundaries.

    pattern = re.compile('|'.join(rf'\b{re.escape(w)}\b' for w in word_list))
    def remove_w(text):
        return pattern.sub('', text)
    

    Alternatively, use str.split to separate into words delimited by spaces, remove the words exactly matching one of those in the set, then join it back together.

    def remove_w(text):
        return ' '.join(w for w in text.split() if w not in word_list)