Does anyone have a chronological list of operations performed by
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
I can see the major components with nlp.pipe_names
['tagger', 'parser', 'ner']
and an alphabetical list of factory operations with nlp.factories
{'merge_entities': <function spacy.language.Language.<lambda>>,
'merge_noun_chunks': <function spacy.language.Language.<lambda>>,
'ner': <function spacy.language.Language.<lambda>>,
'parser': <function spacy.language.Language.<lambda>>,
'sbd': <function spacy.language.Language.<lambda>>,
'sentencizer': <function spacy.language.Language.<lambda>>,
'similarity': <function spacy.language.Language.<lambda>>,
'tagger': <function spacy.language.Language.<lambda>>,
'tensorizer': <function spacy.language.Language.<lambda>>,
'textcat': <function spacy.language.Language.<lambda>>,
'tokenizer': <function spacy.language.Language.<lambda>>}
but I can't figure out when the lemmatizer is invoked. Lemmatization has to happen after tokenization and POS tagging, and it will run with the parser and ner disabled. The spaCy pipeline docs don't mention it at all. Thanks!
For a more recent answer, the pipeline design as of spaCy v3.4 is explained on the spaCy site here. I've reproduced some important parts below should the link become invalid.
The spaCy v3 trained pipelines are designed to be efficient and configurable. For example, multiple components can share a common “token-to-vector” model and it’s easy to swap out or disable the lemmatizer. The pipelines are designed to be efficient in terms of speed and size and work well when the pipeline is run in full.
When modifying a trained pipeline, it’s important to understand how the components depend on each other. Unlike spaCy v2, where the tagger, parser and ner components were all independent, some v3 components depend on earlier components in the pipeline. As a result, disabling or reordering components can affect the annotation quality or lead to warnings and errors.
In the
sm/md/lg
models:
- The
tagger
,morphologizer
andparser
components listen to thetok2vec
component. If the lemmatizer is trainable (v3.3+),lemmatizer
also listens totok2vec
.- The
attribute_ruler
mapstoken.tag
totoken.pos
if there is nomorphologizer
. Theattribute_ruler
additionally makes sure whitespace is tagged consistently and copiestoken.pos
totoken.tag
if there is no tagger. For English, the attribute ruler can improve its mapping fromtoken.tag
totoken.pos
if dependency parses from aparser
are present, but the parser is not required.- The
lemmatizer
component for many languages requirestoken.pos
annotation from eithertagger
+attribute_ruler
ormorphologizer
. Thener
component is independent with its own internaltok2vec
layer.
In the transformer (
trf
) models, thetagger
,parser
andner
(if present) all listen to thetransformer
component. Theattribute_ruler
andlemmatizer
have the same configuration as in the CNN models.