I have a text file. I need to get a list of sentences.
How can this be implemented? There are a lot of subtleties, such as a dot being used in abbreviations.
My old regular expression works badly:
re.compile('(\. |^|!|\?)([A-Z][^;↑\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)
The Natural Language Toolkit (nltk.org) has what you need. This group posting indicates this does it:
import nltk.data
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))
(I haven't tried it!)