Search code examples
nlpnltktokenizetext-processingspacy

training sentence tokenizer in spaCy


I'm trying to tokenize sentences using spacy.

The text includes lots of abbreviations and comments which ends with a period. Also, the text was obtained with OCR and sometimes there are line breaks in the middle of sentences. Spacy doesn't seem to be performing so well in these situations.

I have extracted some examples of how I want these sentences to be split. Is there any way to train spacy's sentence tokenizer?


Solution

  • Spacy is a little unusual in that the default sentence segmentation comes from the dependency parser, so you can't train a sentence boundary detector directly as such, but you can add your own custom component to the pipeline or pre-insert some boundaries that the parser will respect. See their documentation with examples: Spacy Sentence Segmentation

    For the cases you're describing it would potentially be useful also be able to specify that a particular position is NOT a sentence boundary, but as far as I can tell that's not currently possible.