I'm using SpaCy to divide a text into sentences, match a regex pattern on each sentence, and use some logic based on the results of the match. I started with a naive approach such as:
nlp = spacy.load("en_core_web_trf")
regex = re.compile(r'\b(foo|bar)\b')
for text in texts_list:
doc = nlp(text)
for sent in doc.sents:
if re.search(regex, str(s)):
[...]
else:
[...]
and it was very slow. Then I used a pipe:
for doc in nlp.pipe(texts_list, disable=['tagger', 'ner', 'attribute_ruler', 'lemmatizer'], n_process=4):
for sent in doc.sents:
if re.search(regex, str(s)):
[...]
else:
[...]
but it's still slow. Am I missing something?
A transformer model is overkill for splitting sentences and will be very slow. Instead, a good option is the fast senter
from an sm
model:
import spacy
nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer", "ner"])
nlp.enable_pipe("senter")
for doc in nlp.pipe(texts, n_process=4):
...
The senter
should work pretty well if your sentences end with punctuation. If you have a lot of run-on sentences without final punctuation, then the parser
might do a better job. To run only the parser, keep the tok2vec
and parser
components from the original pipeline and don't enable the senter
. The parser
will be ~5-10x slower than the senter
.
If you need this to be even faster, you can use the rule-based sentencizer
(start from a blank en
model), which is typically a bit worse than the senter
because it only splits on the provided punctuation symbols.