Search code examples
pythonnltktokenize

How to parse a file sentence by sentence in Python


I need to read a large amount of large text files.

For each file, I need to open it and read in text sentence by sentence.

Most of approaches I found is read line by line.

How can I do it with Python?


Solution

  • If you want sentence tokenization, nltk is probably the quickest way to do so. http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt Will get you pretty far.

    i.e. code from docs

    >>> import nltk.data
    >>> text = '''
    ... Punkt knows that the periods in Mr. Smith and Johann S. Bach
    ... do not mark sentence boundaries.  And sometimes sentences
    ... can start with non-capitalized words.  i is a good variable
    ... name.
    ... '''
    >>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
    >>> print('\n-----\n'.join(sent_detector.tokenize(text.strip())))
    
    
    Punkt knows that the periods in Mr. Smith and Johann S. Bach
    do not mark sentence boundaries.
    -----
    And sometimes sentences
    can start with non-capitalized words.
    -----
    i is a good variable
    name.