Search code examples
pythonnlpspacylemmatization

Correct multithreaded lemmatization using spaCy


I'm trying to multithread the lemmatization of my corpus using spaCy. Following the documentation, this is currently my approach:

import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner', 'tagger'])

def lemmatize():
    for doc in nlp.pipe(corpus, batch_size=2, n_threads=10):
        yield ' '.join([token.lemma_ for token in doc])

new_corpus = list(lemmatize())

However, this takes the same amount of time regardless when using 10 or 1 thread (I use it on 100.000 documents), suggesting that it is not multithreaded.

Is my implementation wrong?


Solution

  • The n_threads argument has been deprecated in newer versions of spacy and doesn't do anything. See the note here: https://spacy.io/api/language#pipe

    Here's their example code for doing this with multi-processing instead:

    https://spacy.io/usage/examples#multi-processing