Search code examples
python-3.xnltktokenize

Sentence tokenizer retrieve spans


I want to retrieve the spans of the basic ntlk sentence tokenizer (I know it's doable using the pst tokenizer but the basic tokenizer is doing a better job). Is it possible to run the span_tokenize method on sent_tokenize?

from nltk import sent_tokenize
sentences = nltk.sent_tokenize(text)

Solution

  • Assuming you want spans of words.

    from nltk.tokenize import WhitespaceTokenizer as wt
    from nltk import sent_tokenize
    sentences = sent_tokenize("This is a sentence. This is another sentence. The sky is blue.")
    print(list(wt().span_tokenize_sents(sentences)))
    

    output:

    [[(0, 4), (5, 7), (8, 9), (10, 19)], [(0, 4), (5, 7), (8, 15), (16, 25)], [(0, 3), (4, 7), (8, 10), (11, 16)]]
    

    See https://www.nltk.org/api/nltk.tokenize.html. Search for span_tokenize_sents.