Search code examples
spacytraining-datanamed-entity-recognitionwebanno

Does the notation of a named entity label type in spacy have to match with the notation of the annotated label type in the training data?


I want to train the NER-Model by spaCy on my own corpus, which was annotated via WebAnno. Unfortunately, the notation of one NE category in spaCy does not match with the respective notation in WebAnno: In WebAnno, the label is "OTH" whereas spaCy labels it "MISC" (semantically, it's the same). Would this affect the training process or the test accuracy in a negative way? Is it necessary to train an additional NE type "OTH" in this case? Thank you for your help!

spaCy version used: 2.2.5


Solution

  • Yes, of course you want to keep annotations aligned. If it's a one-off operation, it might be easiest to brute-force the problem by replacing the string in your data.

    The more canonical option would appear to be TagMap: https://spacy.io/usage/adding-languages#tag-map. Quote:

    [...] you need to define how [your tags] map down to the Universal Dependencies tag set.

    Their example:

    from ..symbols import POS, NOUN, VERB, DET
    
    TAG_MAP = {
        "NNS":  {POS: NOUN, "Number": "plur"},
        "VBG":  {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
        "DT":   {POS: DET}
    }