Search code examples
nlppipelinespacy

How to use spacys built in lemmatiser in a spacy pipeline?


I want to use lemmatisation, but I can't directly see in the docs how to use Spacys built in lemmatisation in a pipeline.

In the docs for the lemmatiser, it says:

Initialize a Lemmatizer. Typically, this happens under the hood within spaCy when a Language subclass and its Vocab is initialized.

Does this mean the build in lemmatisation process is an unmentioned part of the pipeline?

It's mentioned in the docs as part of the pipeline subheading

enter image description here

while in the docs for the pipeline usage there is only mention of "custom lemmatisation" and how to use it.

enter image description here

This is all kind of conflicting information.


Solution

  • Does this mean the build in lemmatisation process is an unmentioned part of the pipeline?

    Simply, yes. The Lemmatizer is loaded when the Language and Vocab are loaded.

    Usage example:

    import spacy
    nlp=spacy.load('en_core_web_sm')
    doc= nlp(u"Apples and oranges are similar. Boots and hippos aren't.")
    print('\n')
    print("Token Attributes: \n", "token.text, token.pos_, token.tag_, token.dep_, token.lemma_")
    for token in doc:
        # Print the text and the predicted part-of-speech tag
        print("{:<12}{:<12}{:<12}{:<12}{:<12}".format(token.text, token.pos_, token.tag_, token.dep_, token.lemma_))
    
    

    Output:

    Token Attributes: 
     token.text, token.pos_, token.tag_, token.dep_, token.lemma_
    Apples      NOUN        NNS         nsubj       apple       
    and         CCONJ       CC          cc          and         
    oranges     NOUN        NNS         conj        orange      
    are         AUX         VBP         ROOT        be          
    similar     ADJ         JJ          acomp       similar     
    .           PUNCT       .           punct       .           
    Boots       NOUN        NNS         nsubj       boot        
    and         CCONJ       CC          cc          and         
    hippos      NOUN        NN          conj        hippos      
    are         AUX         VBP         ROOT        be          
    n't         PART        RB          neg         not         
    .           PUNCT       .           punct       .      
    

    Check out this thread as well, there is some interesting information regarding the speed of the lemmatization.