Search code examples
pythontokenize

Python: keep apostrophe with verbs


I would like to tokenize a list of sentence, but keep negated verbs as unique words.

t = """As aren't good. Bs are good"""
print(word_tokenize(t))
['As', 'are', "n't", 'good', '.', 'Bs', 'are', 'good']

I would like to have "aren't" and "are" separate. With word_tokenize I get "n't". Same for other negated forms like (Couldn't, didn't, et).

How can I do it? Thanks in advance


Solution

  • If you want to extract individual words from a space-separated sentence, use Python's split() method.

    t = "As aren't good. Bs are good"
    print (t.split())
    ['As', "aren't", 'good.', 'Bs', 'are', 'good']
    

    You can specify other delimiters in the split() method as well. For example, if you wanted to tokenize your string based on a full-stop, you could do something like this:

    print (t.split("."))
    ["As aren't good", ' Bs are good']
    

    Read the documentation here.