Search code examples
pythonnlpnltktokenize

keep trailing punctuation in python nltk.word_tokenize


There's a ton available about removing punctuation, but I can't seem to find anything keeping it.

If I do:

from nltk import word_tokenize
test_str = "Some Co Inc. Other Co L.P."
word_tokenize(test_str)
Out[1]: ['Some', 'Co', 'Inc.', 'Other', 'Co', 'L.P', '.']

the last "." is pushed into its own token. However, if instead there is another word at the end, the last "." is preserved:

from nltk import word_tokenize
test_str = "Some Co Inc. Other Co L.P. Another Co"
word_tokenize(test_str)
Out[1]: ['Some', 'Co', 'Inc.', 'Other', 'Co', 'L.P.', 'Another', 'Co']

I'd like this to always perform as the second case. For now, I'm hackishly doing:

from nltk import word_tokenize
test_str = "Some Co Inc. Other Co L.P."
word_tokenize(test_str + " |||")

since I feel pretty confident in throwing away "|||" at any given time, but don't know what other punctuation I might want to preserve that could get dropped. Is there a better way to accomplish this ?


Solution

  • Could you use re?

    import re
    
    test_str = "Some Co Inc. Other Co L.P."
    
    print re.split('\s', test_str)
    

    This will split the input string based on spacing, retaining your punctuation.