Search code examples
pythonnlpspacylemmatization

spaCy nlp pipeline order of operations


Does anyone have a chronological list of operations performed by

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)

I can see the major components with nlp.pipe_names

['tagger', 'parser', 'ner']

and an alphabetical list of factory operations with nlp.factories

{'merge_entities': <function spacy.language.Language.<lambda>>,
 'merge_noun_chunks': <function spacy.language.Language.<lambda>>,
 'ner': <function spacy.language.Language.<lambda>>,
 'parser': <function spacy.language.Language.<lambda>>,
 'sbd': <function spacy.language.Language.<lambda>>,
 'sentencizer': <function spacy.language.Language.<lambda>>,
 'similarity': <function spacy.language.Language.<lambda>>,
 'tagger': <function spacy.language.Language.<lambda>>,
 'tensorizer': <function spacy.language.Language.<lambda>>,
 'textcat': <function spacy.language.Language.<lambda>>,
 'tokenizer': <function spacy.language.Language.<lambda>>}

but I can't figure out when the lemmatizer is invoked. Lemmatization has to happen after tokenization and POS tagging, and it will run with the parser and ner disabled. The spaCy pipeline docs don't mention it at all. Thanks!


Solution

  • For a more recent answer, the pipeline design as of spaCy v3.4 is explained on the spaCy site here. I've reproduced some important parts below should the link become invalid.

    The spaCy v3 trained pipelines are designed to be efficient and configurable. For example, multiple components can share a common “token-to-vector” model and it’s easy to swap out or disable the lemmatizer. The pipelines are designed to be efficient in terms of speed and size and work well when the pipeline is run in full.

    When modifying a trained pipeline, it’s important to understand how the components depend on each other. Unlike spaCy v2, where the tagger, parser and ner components were all independent, some v3 components depend on earlier components in the pipeline. As a result, disabling or reordering components can affect the annotation quality or lead to warnings and errors.

    CNN/CPU pipeline design

    spaCy v3 pipeline

    In the sm/md/lg models:

    • The tagger, morphologizer and parser components listen to the tok2vec component. If the lemmatizer is trainable (v3.3+), lemmatizer also listens to tok2vec.
    • The attribute_ruler maps token.tag to token.pos if there is no morphologizer. The attribute_ruler additionally makes sure whitespace is tagged consistently and copies token.pos to token.tag if there is no tagger. For English, the attribute ruler can improve its mapping from token.tag to token.pos if dependency parses from a parser are present, but the parser is not required.
    • The lemmatizer component for many languages requires token.pos annotation from either tagger+attribute_ruler or morphologizer. The ner component is independent with its own internal tok2vec layer.

    Transformer pipeline design

    In the transformer (trf) models, the tagger, parser and ner (if present) all listen to the transformer component. The attribute_ruler and lemmatizer have the same configuration as in the CNN models.