Search code examples
pythonpython-3.xspacyspacy-3

Adding tagger to blank English spacy pipeline


I am having a hard time figuring out how to assemble spacy pipelines bit by bit from built in models in spacy V3. I have downloaded the en_core_web_sm model and can load it with nlp = spacy.load("en_core_web_sm"). Processing of sample text works just fine like this.

Now what I want though is to build an English pipeline from blank and add components bit by bit. I do NOT want to load the entire en_core_web_sm pipeline and exclude components. For the sake of concreteness let's say I only want the spacy default tagger in the pipeline. The documentation suggests to me that

import spacy

from spacy.pipeline.tagger import DEFAULT_TAGGER_MODEL
config = {"model": DEFAULT_TAGGER_MODEL}

nlp = spacy.blank("en")
nlp.add_pipe("tagger", config=config)
nlp("This is some sample text.")

should work. However I am getting this error related to hashembed:

Traceback (most recent call last):
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/spacy/language.py", line 1000, in __call__
    doc = proc(doc, **component_cfg.get(name, {}))
  File "spacy/pipeline/trainable_pipe.pyx", line 56, in spacy.pipeline.trainable_pipe.TrainablePipe.__call__
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/spacy/util.py", line 1507, in raise_error
    raise e
  File "spacy/pipeline/trainable_pipe.pyx", line 52, in spacy.pipeline.trainable_pipe.TrainablePipe.__call__
  File "spacy/pipeline/tagger.pyx", line 111, in spacy.pipeline.tagger.Tagger.predict
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 315, in predict
    return self._func(self, X, is_train=False)[0]
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/with_array.py", line 30, in forward
    return _ragged_forward(
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/with_array.py", line 90, in _ragged_forward
    Y, get_dX = layer(Xr.dataXd, is_train)
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/concatenate.py", line 44, in forward
    Ys, callbacks = zip(*[layer(X, is_train=is_train) for layer in model.layers])
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/concatenate.py", line 44, in <listcomp>
    Ys, callbacks = zip(*[layer(X, is_train=is_train) for layer in model.layers])
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/chain.py", line 54, in forward
    Y, inc_layer_grad = layer(X, is_train=is_train)
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
    return self._func(self, X, is_train=is_train)
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/hashembed.py", line 61, in forward
    vectors = cast(Floats2d, model.get_param("E"))
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 216, in get_param
    raise KeyError(
KeyError: "Parameter 'E' for model 'hashembed' has not been allocated yet."


The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3437, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-8e2b4cf9fd33>", line 8, in <module>
    nlp("This is some sample text.")
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/spacy/language.py", line 1003, in __call__
    raise ValueError(Errors.E109.format(name=name)) from e
ValueError: [E109] Component 'tagger' could not be run. Did you forget to call `initialize()`?

hinting I should run initialize(). Ok. If I then run nlp.initialize() I finally get this error

Traceback (most recent call last):
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3437, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-3-eeec225a68df>", line 1, in <module>
    nlp.initialize()
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/spacy/language.py", line 1273, in initialize
    proc.initialize(get_examples, nlp=self, **p_settings)
  File "spacy/pipeline/tagger.pyx", line 271, in spacy.pipeline.tagger.Tagger.initialize
  File "spacy/pipeline/pipe.pyx", line 104, in spacy.pipeline.pipe.Pipe._require_labels
ValueError: [E143] Labels for component 'tagger' not initialized. This can be fixed by calling add_label, or by providing a representative batch of examples to the component's `initialize` method.

Now I am a bit at a loss. Which label examples? Where do I take them from? Why doesn't the default model config take care of that? Do I have to tell spacy to use en_core_web_sm somehow? If so, how can I do so without using spacy.load("en_core_web_sm") and excluding a whole bunch of stuff? Thanks for your hints!

EDIT: Ideally, I would like to be able to load only parts of the pipeline from a modified config file, like nlp = English.from_config(config). I cannot even use the config file shipped with en_core_web_sm as the resulting pipeline needs to be initialized as well, and upon nlp.initialize() I now receive

Traceback (most recent call last):
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3437, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-67-eeec225a68df>", line 1, in <module>
    nlp.initialize()
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/spacy/language.py", line 1246, in initialize
    I = registry.resolve(config["initialize"], schema=ConfigSchemaInit)
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/config.py", line 727, in resolve
    resolved, _ = cls._make(
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/config.py", line 776, in _make
    filled, _, resolved = cls._fill(
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/config.py", line 848, in _fill
    getter_result = getter(*args, **kwargs)
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/spacy/language.py", line 98, in load_lookups_data
    lookups = load_lookups(lang=lang, tables=tables)
  File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/spacy/lookups.py", line 30, in load_lookups
    raise ValueError(Errors.E955.format(table=", ".join(tables), lang=lang))
ValueError: [E955] Can't find table(s) lexeme_norm for language 'en' in spacy-lookups-data. Make sure you have the package installed or provide your own lookup tables if no default lookups are available for your language.

hinting towards the fact that it doesn't find required lookup tables.


Solution

  • nlp.add_pipe("tagger") adds a new blank/uninitialized tagger, not the tagger from en_core_web_sm or any other pretrained pipeline. If you add the tagger this way, you need to initialize and train it before you can use it.

    You can add a component from an existing pipeline using the source option:

    nlp = spacy.add_pipe("tagger", source=spacy.load("en_core_web_sm"))
    

    That said, it's possible that the tokenization from spacy.blank("en") is different from what the tagger in the source pipeline was trained on. In general (and especially once you move away from spacy's pretrained pipelines), you should also make sure the tokenizer settings are the same, and loading while excluding components is an easy way to do this.

    Alternatively, you can copy the tokenizer settings in addition to using nlp.add_pipe(source=) for models like scispacy's en_core_sci_sm, which is a good example of a pipeline the tokenization is not the same as spacy.blank("en"):

    nlp = spacy.blank("en")
    source_nlp = spacy.load("en_core_sci_sm")
    nlp.tokenizer.from_bytes(source_nlp.tokenizer.to_bytes())
    nlp.add_pipe("tagger", source=source_nlp)