I want to separate texts into sentences.
looking in stack overflow I found:
WITH NLTK
from nltk.tokenize import sent_tokenize
text="""Hello Mr. Smith, how are you doing today? The weathe is great, and city is awesome. The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text=sent_tokenize(text)
print(tokenized_text)
WITH SPACY
from spacy.lang.en import English # updated
raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
nlp.add_pipe(nlp.create_pipe('sentencizer')) # updated
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]
The question is what in the background for spacy having to do it differently with a so called create_pipe. Sentences are important for training your own word embedings for NLP. There should be a reason why spaCy does not include directly out of the box a sentence tokenizer.
Thanks.
NOTE: Be aware that a simply .split(.) does not work, there are several decimal numbers in the text and other kind of tokens containing '.'
By default, spaCy uses its dependency parser to do sentence segmentation, which requires loading a statistical model.
The sentencizer
is a rule-based sentence segmenter that you can use to define your own sentence segmentation rules without loading a model.
If you don't mind leaving the parser activated, you can use the following code:
import spacy
nlp = spacy.load('en_core_web_sm') # or whatever model you have installed
raw_text = 'Hello, world. Here are two sentences.'
doc = nlp(raw_text)
sentences = [sent.text.strip() for sent in doc.sents]