Search code examples
pythonnlpspacytransformer-modelsentence

fast filtering of sentences in spacy


I'm using SpaCy to divide a text into sentences, match a regex pattern on each sentence, and use some logic based on the results of the match. I started with a naive approach such as:

nlp = spacy.load("en_core_web_trf")
regex = re.compile(r'\b(foo|bar)\b')

for text in texts_list:
  doc = nlp(text)
  for sent in doc.sents:
    if re.search(regex, str(s)):
    [...]
    else:
    [...]

and it was very slow. Then I used a pipe:

for doc in nlp.pipe(texts_list, disable=['tagger', 'ner', 'attribute_ruler', 'lemmatizer'], n_process=4):
  for sent in doc.sents:
    if re.search(regex, str(s)):
    [...]
    else:
    [...]

but it's still slow. Am I missing something?


Solution

  • A transformer model is overkill for splitting sentences and will be very slow. Instead, a good option is the fast senter from an sm model:

    import spacy
    nlp = spacy.load("en_core_web_sm", disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer", "ner"])
    nlp.enable_pipe("senter")
    for doc in nlp.pipe(texts, n_process=4):
        ...
    

    The senter should work pretty well if your sentences end with punctuation. If you have a lot of run-on sentences without final punctuation, then the parser might do a better job. To run only the parser, keep the tok2vec and parser components from the original pipeline and don't enable the senter. The parser will be ~5-10x slower than the senter.

    If you need this to be even faster, you can use the rule-based sentencizer (start from a blank en model), which is typically a bit worse than the senter because it only splits on the provided punctuation symbols.