Search code examples
nlpspacylemmatization

Adding a lemma for a new word and the concept of normalization/lemmatization in spaCy


Following the examples from documentation regarding tokenization I have the following code:

import spacy
from spacy.symbols import ORTH, NORM

nlp = spacy.load("en_core_web_sm")
special_case = [{ORTH: "gim", NORM: "give"}, {ORTH: "me"}]
nlp.tokenizer.add_special_case("gimme", special_case)

doc = nlp("gimme that. he gave me that. Going to someplace.")

Then I check the tokenization

doc[0].norm_  # 'give'  (as expected)

But the lemmatizer does not return the same output

lemmatizer = nlp.get_pipe("lemmatizer")
lemmatizer.lemmatize(doc[0])  # ['gim']  (expected ['give']

In other hand

lemmatizer.lemmatize(doc[5]) # ['give']
lemmatizer.lemmatize(doc[9]) # [go']

What I'm doing wrong? How to "fix"? In spaCy what is the difference between normalized tokens and lemmatized tokens? How can I "teach" the lemmatization of a single token (as this gim token in example) ?


Solution

  • In your code you've customized the tokenizer to handle the special case "gimme" and normalize it to "give.

    Here's how you can achieve consistent lemmatization results with your custom normalization

    import spacy
    from spacy.language import Language
    from spacy.symbols import ORTH, NORM
            
    nlp = spacy.load("en_core_web_sm")
    special_case = [{ORTH: "gim", NORM: "give"}, {ORTH: "me"}]
    nlp.tokenizer.add_special_case("gimme", special_case)
            
    # Define a custom lemmatization function
    @Language.component(name="custom_lemmatizer")
    def custom_lemmatizer_function(doc):
        for token in doc:
            if token.norm_ == "give":
                token.lemma_ = "give"
        # Add more custom rules for other words if needed
        return doc
            
    # Add the custom lemmatizer to the pipeline
    nlp.add_pipe("custom_lemmatizer", name="custom_lemmatizer", after="lemmatizer")
            
    doc = nlp("gimme that. he gave me that. Going to someplace.")
    print(doc[0].lemma_)  # 'give' (as expected)
    print(doc[5].lemma_)  # 'give' (as expected)
    print(doc[9].lemma_)  # 'go' (as expected)