Search code examples
pythonnltktokenize

How to treat certain words as delimiters in nltk Python?


I'm trying to tokenize the below text with stopwords('is', 'the', 'was') as delimiters

The expected output is this:

['Walter', 
 'feeling anxious', 
 'He', 
 'diagnosed today,' 
 'He probably', 
 'best person I know']

This is the code which I trying to make the above output

import nltk 
stopwords = ['is', 'the', 'was']

sents = nltk.sent_tokenize("Walter was feeling anxious. He was diagnosed today. He probably is the best person I know.")

sents_rm_stopwords = [] 

for sent in sents:
    sents_rm_stopwords.append(' '.join(w for w in nltk.word_tokenize(sent) if w not in stopwords))

My code output is this:

['Walter feeling anxious .',
 'He diagnosed today .', 
 'He probably best person I know .']

How can I make the expected output?


Solution

  • So the problem considers both stopwords and line delimiters. Assuming that we can define a line by the symbol ., you can introduce that to multiple splits by using re.split().

    import re
    s = "Walter was feeling anxious. He was diagnosed today. He probably is the best person I know."
    result = re.split(" was | is | the |\. |\.", s)
    
    results
    >>
    ['Walter',
     'feeling anxious',
     'He',
     'diagnosed today',
     'He probably',
     'the best person I know',
     '']
    

    Because we are using both single . and . with a whitespace after, the split results will return an additional ''. Assuming that this structure of sentences are consistent, you can slice the results to get your expected results.

    result[:-1]
    >>
    ['Walter',
     'feeling anxious',
     'He',
     'diagnosed today',
     'He probably',
     'the best person I know']