There's a ton available about removing punctuation, but I can't seem to find anything keeping it.
If I do:
from nltk import word_tokenize
test_str = "Some Co Inc. Other Co L.P."
word_tokenize(test_str)
Out[1]: ['Some', 'Co', 'Inc.', 'Other', 'Co', 'L.P', '.']
the last "." is pushed into its own token. However, if instead there is another word at the end, the last "." is preserved:
from nltk import word_tokenize
test_str = "Some Co Inc. Other Co L.P. Another Co"
word_tokenize(test_str)
Out[1]: ['Some', 'Co', 'Inc.', 'Other', 'Co', 'L.P.', 'Another', 'Co']
I'd like this to always perform as the second case. For now, I'm hackishly doing:
from nltk import word_tokenize
test_str = "Some Co Inc. Other Co L.P."
word_tokenize(test_str + " |||")
since I feel pretty confident in throwing away "|||" at any given time, but don't know what other punctuation I might want to preserve that could get dropped. Is there a better way to accomplish this ?
Could you use re
?
import re
test_str = "Some Co Inc. Other Co L.P."
print re.split('\s', test_str)
This will split the input string based on spacing, retaining your punctuation.