Search code examples
pythonperformanceperlnlpspacy

Lemmatize multiple MB of raw text with Spacy and Inline::Python in Perl. Why is this slow?


I work on an NLP and I need to lemmatize tons of tokens from raw input text file from 10MB to 300MB and I decided to use Inline::Python with spacy to do this task. The problem is that it's very slow. After this, I create bags of words to put in a cosine similarity module to classify texts from the past years. Is there a way to process faster, multi-processing, multi-threading, or is it the pipe to Python that is slow? And i have i9, 64GB RAM, RTX 2080TI and SSD connected by nvme.

Here is the piece of code to lemmatize in French some text content and filter stop words:

use Inline Python => <<'END_OF_PYTHON';

import spacy
from spacy.lang.fr.stop_words import STOP_WORDS as fr_stop
nlp = spacy.load('fr_core_news_md')
nlp.max_length = 40000000

def lemmatizer(words):
    doc = nlp(words)
    return list(filter(lambda x: x not in list(fr_stop), list(map(lambda token: token.lemma_ , doc))))

END_OF_PYTHON

Unfortunately, there is no good French lemmatizer in Perl and the lemmatization increases my accuracy to classify text files in good categories by 5%. It's important when you have already 90% good results without it. In this piece of code, I only use the function lemmatizer in Perl after this. I don't reload each time the nlp spacy module in French (I think ?)

I thought about creating one thread per file. I have 15 big text files to lemmatize. One file per category from the recent years. But imo, the I/O is the problem. Do you have some ideas? I can't show more code because there are 1500 lines. I need 1000 seconds to process automatic classification with the smallest category (50/60 files from the current year). The biggest is 10x bigger than the smallest.


Solution

  • There is a number of speed improvements that you could try:

    1. Using yield (actually yield from) instead of constructing the list in memory before returning it. Also, I don't think you need to create a list from the results from map:
    def lemmatizer(words):
        doc = nlp(words)
        yield from filter(lambda x: x not in list(fr_stop), map(lambda token: token.lemma_, doc))
    
    1. Using a set instead of a list for containment checking:
    fr_stop = set(fr_stop)
    def lemmatizer(words):
        doc = nlp(words)
        yield from filter(lambda x: x not in fr_stop, map(lambda token: token.lemma_ , doc))
    

    These should help reducing both processing time and memory pressure.