I am using Spacy lemmatization for preprocessing texts.
doc = 'ups'
for i in nlp(doc):
print(i.lemma_)
>> up
I understand why spacy remove the 's', but it is important for me that in that case, it won't do it. Is there a way to add specific rules to spacy or do I have to use If statements outside the process (which is something I don't want to do )
For spacy v2:
Depending on whether you have a tagger, you can customize the rule-based lemmatizer exceptions or the lookup table:
import spacy
# original
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
# may be "up" or "ups" depending on exact version of spacy/model because it
# depends on the POS tag
assert nlp("ups")[0].lemma_ in ("ups", "up")
# 1. Exception for rule-based lemmatizer (with tagger)
# reload to start with a clean lemma cache
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
# add an exception for "ups" as with POS NOUN or VERB
nlp.vocab.lookups.get_table("lemma_exc")["noun"]["ups"] = ["ups"]
nlp.vocab.lookups.get_table("lemma_exc")["verb"]["ups"] = ["ups"]
assert nlp("ups")[0].lemma_ == "ups"
# 2. New entry for lookup lemmatizer (without tagger)
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner"])
nlp.vocab.lookups.get_table("lemma_lookup")["ups"] = "ups"
assert nlp("ups")[0].lemma_ == "ups"
If you are processing words in isolation, the tagger is not going to be very reliable (you might get NOUN
, PROPN
, or VERB
for something like ups
), so it might be easier to deal with customizing the lookup lemmatizer. The quality of the rule-based lemmas is better overall, but you need at least full phrases, preferably full sentences, to get reasonable results.