I'm trying to tokenize the below text with stopwords('is', 'the', 'was') as delimiters
The expected output is this:
['Walter',
'feeling anxious',
'He',
'diagnosed today,'
'He probably',
'best person I know']
This is the code which I trying to make the above output
import nltk
stopwords = ['is', 'the', 'was']
sents = nltk.sent_tokenize("Walter was feeling anxious. He was diagnosed today. He probably is the best person I know.")
sents_rm_stopwords = []
for sent in sents:
sents_rm_stopwords.append(' '.join(w for w in nltk.word_tokenize(sent) if w not in stopwords))
My code output is this:
['Walter feeling anxious .',
'He diagnosed today .',
'He probably best person I know .']
How can I make the expected output?
So the problem considers both stopwords and line delimiters. Assuming that we can define a line by the symbol .
, you can introduce that to multiple splits by using re.split()
.
import re
s = "Walter was feeling anxious. He was diagnosed today. He probably is the best person I know."
result = re.split(" was | is | the |\. |\.", s)
results
>>
['Walter',
'feeling anxious',
'He',
'diagnosed today',
'He probably',
'the best person I know',
'']
Because we are using both single .
and .
with a whitespace after, the split results will return an additional ''
. Assuming that this structure of sentences are consistent, you can slice the results to get your expected results.
result[:-1]
>>
['Walter',
'feeling anxious',
'He',
'diagnosed today',
'He probably',
'the best person I know']