I'm trying to multithread the lemmatization of my corpus using spaCy. Following the documentation, this is currently my approach:
import spacy
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner', 'tagger'])
def lemmatize():
for doc in nlp.pipe(corpus, batch_size=2, n_threads=10):
yield ' '.join([token.lemma_ for token in doc])
new_corpus = list(lemmatize())
However, this takes the same amount of time regardless when using 10 or 1 thread (I use it on 100.000 documents), suggesting that it is not multithreaded.
Is my implementation wrong?
The n_threads
argument has been deprecated in newer versions of spacy and doesn't do anything. See the note here: https://spacy.io/api/language#pipe
Here's their example code for doing this with multi-processing instead: