Search code examples
pythontextsplit

How to remove certain words from text while keeping punctuation marks


I have the following code that removes Bangla words from given text. It can remove listed words from text successfully, but it fails to remove a word with punctuation. For example, here, from input text "বিশ্বের তথা দূষিত না বায়ুর না, শহরের না।", it can remove, "তথা" and "না"(since listed in word_list) but it can't remove না with punctuations( "না," and "না।" ) . I want to remove words with punctuations as well but keeping the punctuations. Please see the current and expected output below. Thanks a lot. Punctuation list=[,।?]

word_list = {'নিজের', 'তথা', 'না'}
def remove_w(text):
    return ' '.join(w for w in text.split() if w not in word_list)
remove_w('বিশ্বের তথা দূষিত না বায়ুর না, শহরের না।')

Current Output::: 'বিশ্বের দূষিত বায়ুর না, শহরের না।'

Expected Output::: 'বিশ্বের দূষিত বায়ুর, শহরের।'


Solution

  • The following code does what you desire:

    import re
    
    word_list = {'নিজের', 'তথা', 'না'}
    def remove_w(text):
        result = ''.join(w for w in re.split(r'([ ,।\?])', text) if w not in word_list)
        # sanitize the result:
        result = re.sub(r' +([,।\?])', r'\1', result)
        result = re.sub(r' +', r' ', result)
        return result
    
    remove_w('বিশ্বের তথা দূষিত না বায়ুর না, শহরের না।')
    # 'বিশ্বের দূষিত বায়ুর, শহরের।'
    

    The parentheses within r'([ ,।\?])' serve to keep the delimiters in the result:

    re.split(pattern, string) [simplified: default arguments omitted]
    If capturing parentheses are used in pattern, then the text of all groups in the pattern [is] also returned as part of the resulting list.

    Note that we need manual sanitization of the result:

    • spaces before punctuation will be removed
    • multiple successive spaces are merged into a single space

    I would also like to draw other readers' attention to the fact that is a Bengali punctuation mark (দাঁড়ি, in English commonly referred to as daṇḍa), not an ASCII vertical bar |.