Search code examples
pythonspacytransformer-model

How to use the pretrained transformer model ("en_trf_bertbaseuncased_lg") in SpaCy?


I was wondering, how I could use the pretrained transformer model en_trf_bertbaseuncased_lg from spacy for future NLP tasks (NER, POS, etc.). The documentation states, that the module can only be used for the following pipeline preprocessing modules (https://spacy.io/models/en#en_trf_bertbaseuncased_lg):

  • sentencizer
  • trf_wordpiecer
  • trf_tok2vec

Can anyone explain to me, what these components are doing and in which tasks they can be used? Or does anyone know a good sources to read about it?

>>> import spacy
>>> nlp = spacy.load("en_trf_bertbaseuncased_lg")
>>> nlp.pipe_names
[sentencizer, trf_wordpiecer, trf_tok2vec]

Solution

  • trf_wordpiecer component

    • accessible via doc._.trf_alignment
    • performs the model’s wordpiece pre-processing

    Quote from the docs:

    Wordpiece is convenient for training neural networks, but it doesn't produce segmentations that match up to any linguistic notion of a "word". Most rare words will map to multiple wordpiece tokens, and occasionally the alignment will be many-to-many.

    trf_tok2vec component

    • accessible via doc._.trf_last_hidden_state
    • stores the raw outputs of the transformer: one tensor with one row per wordpiece token
    • however, what you probably want are the token-aligned features in doc.tensor.

    See also this blog article introducing spacy's transformer integration.