Search code examples
pythonnlpnewlinenltkline-breaks

Preserve empty lines with NLTK's Punkt Tokenizer


I'm using the NLTK's PUNKT sentence tokenizer to split a file into a list of sentences, and would like to preserve the empty lines within the file:

from nltk import data
tokenizer = data.load('tokenizers/punkt/english.pickle')
s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n"
sentences = tokenizer.tokenize(s)
print sentences

I would like this to print:

['That was a very loud beep.\n\n', "I don't even know\n if this is working.", 'Mark?\n\n', 'Mark are you there?\n\n\n']

But the content that's actually printed shows that the trailing empty lines have been removed from the first and third sentences:

['That was a very loud beep.', "I don't even know\n if this is working.", 'Mark?', 'Mark are you there?\n\n\n']

Other tokenizers in NLTK have a blanklines='keep' parameter, but I don't see any such option in the case of the Punkt tokenizer. It's very possible I'm missing something simple. Is there a way to retrain these trailing empty lines using the Punkt sentence tokenizer? I'd be grateful for any insights others can offer!


Solution

  • The problem

    Sadly, you can't make the tokenizer keep the blanklines, not with the way the it is written.

    Starting here and following the function calls through span_tokenize() and _slices_from_text(), you can see there is a condition

    if match.group('next_tok'):

    that is designed to ensure the tokenizer skips whitespace until the next possible sentence starting token occurs. Looking for the regex this refers to, we end up looking at _period_context_fmt, where we see that the next_tok named group is preceded by \s+, where blanklines won't be captured.

    The solution

    Break it down, change the part that you don't like, reassemble your custom solution.

    Now this regex is in the PunktLanguageVars class, itself used to initialize the PunktSentenceTokenizer class. We just have to derive a custom class from PunktLanguageVars and fix the regex the way we want it to be.

    The fix we want is to include trailing newlines at the end of a sentence, so I suggest replacing the _period_context_fmt, going from this:

    _period_context_fmt = r"""
        \S*                          # some word material
        %(SentEndChars)s             # a potential sentence ending
        (?=(?P<after_tok>
            %(NonWord)s              # either other punctuation
            |
            \s+(?P<next_tok>\S+)     # or whitespace and some other token
        ))"""
    

    to this:

    _period_context_fmt = r"""
        \S*                          # some word material
        %(SentEndChars)s             # a potential sentence ending
        \s*                       #  <-- THIS is what I changed
        (?=(?P<after_tok>
            %(NonWord)s              # either other punctuation
            |
            (?P<next_tok>\S+)     #  <-- Normally you would have \s+ here
        ))"""
    

    Now a tokenizer using this regex instead of the older will include 0 or more \s characters after the end of a sentence.

    The whole script

    import nltk.tokenize.punkt as pkt
    
    class CustomLanguageVars(pkt.PunktLanguageVars):
    
        _period_context_fmt = r"""
            \S*                          # some word material
            %(SentEndChars)s             # a potential sentence ending
            \s*                       #  <-- THIS is what I changed
            (?=(?P<after_tok>
                %(NonWord)s              # either other punctuation
                |
                (?P<next_tok>\S+)     #  <-- Normally you would have \s+ here
            ))"""
    
    custom_tknzr = pkt.PunktSentenceTokenizer(lang_vars=CustomLanguageVars())
    
    s = "That was a very loud beep.\n\n I don't even know\n if this is working. Mark?\n\n Mark are you there?\n\n\n"
    
    print(custom_tknzr.tokenize(s))
    

    This outputs:

    ['That was a very loud beep.\n\n ', "I don't even know\n if this is working. ", 'Mark?\n\n ', 'Mark are you there?\n\n\n']