python nlp spacy transformer-model sentence

fast filtering of sentences in spacy

I'm using SpaCy to divide a text into sentences, match a regex pattern on each sentence, and use some logic based on the results of the match. I started with a naive approach such as:

nlp = spacy.load("en_core_web_trf")
regex = re.compile(r'\b(foo|bar)\b')

for text in texts_list:
  doc = nlp(text)
  for sent in doc.sents:
    if re.search(regex, str(s)):
    [...]
    else:
    [...]

and it was very slow. Then I used a pipe:

for doc in nlp.pipe(texts_list, disable=['tagger', 'ner', 'attribute_ruler', 'lemmatizer'], n_process=4):
  for sent in doc.sents:
    if re.search(regex, str(s)):
    [...]
    else:
    [...]

but it's still slow. Am I missing something?

Solution

A transformer model is overkill for splitting sentences and will be very slow. Instead, a good option is the fast senter from an sm model:

import spacy
nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer", "ner"])
nlp.enable_pipe("senter")
for doc in nlp.pipe(texts, n_process=4):
    ...

The senter should work pretty well if your sentences end with punctuation. If you have a lot of run-on sentences without final punctuation, then the parser might do a better job. To run only the parser, keep the tok2vec and parser components from the original pipeline and don't enable the senter. The parser will be ~5-10x slower than the senter.

If you need this to be even faster, you can use the rule-based sentencizer (start from a blank en model), which is typically a bit worse than the senter because it only splits on the provided punctuation symbols.