I want to use lemmatisation, but I can't directly see in the docs how to use Spacys built in lemmatisation in a pipeline.
In the docs for the lemmatiser, it says:
Initialize a Lemmatizer. Typically, this happens under the hood within spaCy when a
Language
subclass and itsVocab
is initialized.
Does this mean the build in lemmatisation process is an unmentioned part of the pipeline?
It's mentioned in the docs as part of the pipeline subheading
while in the docs for the pipeline usage there is only mention of "custom lemmatisation" and how to use it.
This is all kind of conflicting information.
Does this mean the build in lemmatisation process is an unmentioned part of the pipeline?
Simply, yes. The Lemmatizer is loaded when the Language
and Vocab
are loaded.
Usage example:
import spacy
nlp=spacy.load('en_core_web_sm')
doc= nlp(u"Apples and oranges are similar. Boots and hippos aren't.")
print('\n')
print("Token Attributes: \n", "token.text, token.pos_, token.tag_, token.dep_, token.lemma_")
for token in doc:
# Print the text and the predicted part-of-speech tag
print("{:<12}{:<12}{:<12}{:<12}{:<12}".format(token.text, token.pos_, token.tag_, token.dep_, token.lemma_))
Output:
Token Attributes:
token.text, token.pos_, token.tag_, token.dep_, token.lemma_
Apples NOUN NNS nsubj apple
and CCONJ CC cc and
oranges NOUN NNS conj orange
are AUX VBP ROOT be
similar ADJ JJ acomp similar
. PUNCT . punct .
Boots NOUN NNS nsubj boot
and CCONJ CC cc and
hippos NOUN NN conj hippos
are AUX VBP ROOT be
n't PART RB neg not
. PUNCT . punct .
Check out this thread as well, there is some interesting information regarding the speed of the lemmatization.