Search code examples
pythonnltktokenize

Sentence segmentation using nltk in big text files


I need to use nltk.sent_tokenize() to extract sentences from big text files. Files size vary from 1MB to 400MB so it's not possible to load files entirely because of memory limits and I think it's not possible to use nltk.sent_tokenize() and read files line by line.

what do you suggest to do this task?


Solution

  • Did you try just using the reader? The nltk corpus readers are designed to deliver text incrementally, reading large blocks from disk behind the scenes rather than entire files. So just open a PlaintextCorpusReader on your entire corpus, and it ought to deliver your entire corpus sentence by sentence without any shenanigans. For example:

    reader = nltk.corpus.reader.PlaintextCorpusReader("path/to/corpus", r".*\.txt")
    for sent in reader.sents():
        if "shenanigans" in sent:
            print(" ".join(sent))