Search code examples
pythonnlpnltkspacysentence

separate texts into sentences NLTK vs spaCy


I want to separate texts into sentences.

looking in stack overflow I found:

WITH NLTK

from nltk.tokenize import sent_tokenize
text="""Hello Mr. Smith, how are you doing today? The weathe is great, and city is awesome. The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text=sent_tokenize(text)
print(tokenized_text)

WITH SPACY

from spacy.lang.en import English # updated

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer')) # updated
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

The question is what in the background for spacy having to do it differently with a so called create_pipe. Sentences are important for training your own word embedings for NLP. There should be a reason why spaCy does not include directly out of the box a sentence tokenizer.

Thanks.

NOTE: Be aware that a simply .split(.) does not work, there are several decimal numbers in the text and other kind of tokens containing '.'


Solution

  • By default, spaCy uses its dependency parser to do sentence segmentation, which requires loading a statistical model. The sentencizer is a rule-based sentence segmenter that you can use to define your own sentence segmentation rules without loading a model.

    If you don't mind leaving the parser activated, you can use the following code:

    import spacy
    nlp = spacy.load('en_core_web_sm') # or whatever model you have installed
    raw_text = 'Hello, world. Here are two sentences.'
    doc = nlp(raw_text)
    sentences = [sent.text.strip() for sent in doc.sents]