Search code examples
nlpspacy

Add rules to Spacy lemmatization


I am using Spacy lemmatization for preprocessing texts.

doc = 'ups'
for i in nlp(doc):
print(i.lemma_)
>> up

I understand why spacy remove the 's', but it is important for me that in that case, it won't do it. Is there a way to add specific rules to spacy or do I have to use If statements outside the process (which is something I don't want to do )


Solution

  • For spacy v2:

    Depending on whether you have a tagger, you can customize the rule-based lemmatizer exceptions or the lookup table:

    import spacy
    
    # original
    nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
    # may be "up" or "ups" depending on exact version of spacy/model because it
    # depends on the POS tag
    assert nlp("ups")[0].lemma_ in ("ups", "up")
    
    # 1. Exception for rule-based lemmatizer (with tagger)
    
    # reload to start with a clean lemma cache
    nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
    # add an exception for "ups" as with POS NOUN or VERB
    nlp.vocab.lookups.get_table("lemma_exc")["noun"]["ups"] = ["ups"]
    nlp.vocab.lookups.get_table("lemma_exc")["verb"]["ups"] = ["ups"]
    assert nlp("ups")[0].lemma_ == "ups"
    
    # 2. New entry for lookup lemmatizer (without tagger)
    
    nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner"])
    nlp.vocab.lookups.get_table("lemma_lookup")["ups"] = "ups"
    assert nlp("ups")[0].lemma_ == "ups"
    

    If you are processing words in isolation, the tagger is not going to be very reliable (you might get NOUN, PROPN, or VERB for something like ups), so it might be easier to deal with customizing the lookup lemmatizer. The quality of the rule-based lemmas is better overall, but you need at least full phrases, preferably full sentences, to get reasonable results.