I have the following code that removes Bangla words from given text. It can remove listed words from text successfully, but it fails to remove a word with punctuation. For example, here, from input text "বিশ্বের তথা দূষিত না বায়ুর না, শহরের না।", it can remove, "তথা" and "না"(since listed in word_list) but it can't remove না with punctuations( "না," and "না।" ) . I want to remove words with punctuations as well but keeping the punctuations. Please see the current and expected output below. Thanks a lot. Punctuation list=[,।?]
word_list = {'নিজের', 'তথা', 'না'}
def remove_w(text):
return ' '.join(w for w in text.split() if w not in word_list)
remove_w('বিশ্বের তথা দূষিত না বায়ুর না, শহরের না।')
Current Output::: 'বিশ্বের দূষিত বায়ুর না, শহরের না।'
Expected Output::: 'বিশ্বের দূষিত বায়ুর, শহরের।'
The following code does what you desire:
import re
word_list = {'নিজের', 'তথা', 'না'}
def remove_w(text):
result = ''.join(w for w in re.split(r'([ ,।\?])', text) if w not in word_list)
# sanitize the result:
result = re.sub(r' +([,।\?])', r'\1', result)
result = re.sub(r' +', r' ', result)
return result
remove_w('বিশ্বের তথা দূষিত না বায়ুর না, শহরের না।')
# 'বিশ্বের দূষিত বায়ুর, শহরের।'
The parentheses within r'([ ,।\?])'
serve to keep the delimiters in the result:
re.split(pattern, string)
[simplified: default arguments omitted]
If capturing parentheses are used in pattern, then the text of all groups in the pattern [is] also returned as part of the resulting list.
Note that we need manual sanitization of the result:
I would also like to draw other readers' attention to the fact that ।
is a Bengali punctuation mark (দাঁড়ি, in English commonly referred to as daṇḍa), not an ASCII vertical bar |
.