How to remove certain words from text while keeping punctuation marks

I have the following code that removes Bangla words from given text. It can remove listed words from text successfully, but it fails to remove a word with punctuation. For example, here, from input text "বিশ্বের তথা দূষিত না বায়ুর না, শহরের না।", it can remove, "তথা" and "না"(since listed in word_list) but it can't remove না with punctuations( "না," and "না।" ) . I want to remove words with punctuations as well but keeping the punctuations. Please see the current and expected output below. Thanks a lot. Punctuation list=[,।?]

word_list = {'নিজের', 'তথা', 'না'}
def remove_w(text):
    return ' '.join(w for w in text.split() if w not in word_list)
remove_w('বিশ্বের তথা দূষিত না বায়ুর না, শহরের না।')

Current Output::: 'বিশ্বের দূষিত বায়ুর না, শহরের না।'

Expected Output::: 'বিশ্বের দূষিত বায়ুর, শহরের।'

Solution

The following code does what you desire:

import re

word_list = {'নিজের', 'তথা', 'না'}
def remove_w(text):
    result = ''.join(w for w in re.split(r'([ ,।\?])', text) if w not in word_list)
    # sanitize the result:
    result = re.sub(r' +([,।\?])', r'\1', result)
    result = re.sub(r' +', r' ', result)
    return result

remove_w('বিশ্বের তথা দূষিত না বায়ুর না, শহরের না।')
# 'বিশ্বের দূষিত বায়ুর, শহরের।'

The parentheses within r'([ ,।\?])' serve to keep the delimiters in the result:

re.split(pattern, string) [simplified: default arguments omitted]
If capturing parentheses are used in pattern, then the text of all groups in the pattern [is] also returned as part of the resulting list.

Note that we need manual sanitization of the result:

spaces before punctuation will be removed
multiple successive spaces are merged into a single space

I would also like to draw other readers' attention to the fact that । is a Bengali punctuation mark (দাঁড়ি, in English commonly referred to as daṇḍa), not an ASCII vertical bar |.