Search code examples
pythonspacylemmatization

Spacy lemmatizer issue/consistency


I'm currently using spaCy for NLP purpose (mainly lemmatization and tokenization). The model used is en-core-web-sm (2.1.0).

The following code is run to retrieve a list of words "cleansed" from a query

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(query)
list_words = []
for token in doc:
    if token.text != ' ':
        list_words.append(token.lemma_)

However I face a major issue, when running this code. For example, when the query is "processing of tea leaves". The result stored in list_words can be either ['processing', 'tea', 'leaf'] or ['processing', 'tea', 'leave'].

It seems that the result is not consistent. I cannot change my input/query (adding another word for context is not possible) and I really need to find the same result every time. I think the loading of the model may be the issue.

Why the result differ ? Can I load the model the "same" way everytime ? Did I miss a parameter to obtain the same result for ambiguous query ?

Thanks for your help


Solution

  • The issue was analysed by the spaCy team and they've come up with a solution. Here's the fix : https://github.com/explosion/spaCy/pull/3646

    Basically, when the lemmatization rules were applied, a set was used to return a lemma. Since a set has no ordering, the returned lemma could change in between python session.


    For example in my case, for the noun "leaves", the potential lemmas were "leave" and "leaf". Without ordering, the result was random - it could be "leave" or "leaf".