Search code examples
pythonnlpspacy

How to force a certain tag in spaCy?


I'm using spaCy '3.0.0rc2' with a custom model. Unfortunately my training data is low in hyphens (-), therefore the hyphen often gets tagged as NOUN.

Is there some way to force a certain tag or pos, to make sure that all the - tokens get tagged with PUNCT?

Basically I am looking for a solution like proposed in the answer to this question here: How to force a pos tag in spacy before/after tagger?

Unfortunately this does not seem to work anymore (at least for spaCy 3) and raises an error:

ValueError: [E1005] Unable to set attribute 'POS' in tokenizer exception for '{G}'. Tokenizer exceptions are only allowed to specify ORTH and NORM.

(Same when trying to assign the TAG attribute)

I know that it would be possible to create a custom component with a Matcher that looks just for the hyphen and assigns the right tag. However this seems to be overkill when considering that I currently just want to handle one token.

Is there some way to force tags in spaCy 3, without re-tagging during processing using a custom component?

Ideally I would want to modify the TAG attribute and let the POS attribute get assigned automatically by spaCy based on that TAG attribute. As in the spacy-annotations TAG=HYPH should be mapped to POS=PUNCT.


Solution

  • In spaCy v3, exceptions like this can be implemented in the attribute_ruler component:

    ruler = nlp.add_pipe("attribute_ruler")
    patterns = [[{"ORTH": "-"}]]
    attrs = {"TAG": "HYPH", "POS": "PUNCT"}
    ruler.add(patterns=patterns, attrs=attrs)
    

    Be aware that the attribute ruler runs the pattern matching once based on the initial Doc state, so you can't use the output attrs of one rule as the input pattern for another. This comes up in pipelines like en_core_web_sm, where the included attribute ruler does the tag->pos mapping. So if you have another rule that should match on a pos pattern, you'd have to add a second attribute ruler component to handle those cases.

    See: https://nightly.spacy.io/api/attributeruler