Search code examples
pythonnlpspacyspacy-3

Rename spacy's pos tagger labels


i'm looking for something specific and didn't really found an answer: I'm looking to rename the pos tags label of spacy. E.g. if i have this code:

def eng(textstr):    
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(textstr)
    for token in doc:
    print("Word: "+token.text+ " "+"POS: "+token.pos_)

I want token.pos_ to give me NIA instead of NOUN, BO instead of VERB, etc... I don't want to retrain anything if i can. The results given by the pos tagger are accurate enough for me, i just want to rename each label (Noun to NIA, Verb to BO, etcc..). So instead of having a NOUN i want token.pos_ to give me back NIA. First is this possible and if it is, how can i do it? The first thing that came me to mind is to use simply an if statement:

if token.pos_ == "NOUN"
  print("Word: "+token.text+ " "+"POS: NIA")

but that cannot be done because than i have to change about 5000 functions, which is impossible. Is there another way? Thank you very much for your help!


Solution

  • This is not possible. The .pos attribute specifically only holds Universal Dependency tags, and will give an error if you try to set another value. You can set any value in the .tag attribute if you want, though it's designed for language specific fine-grained tags, which have more detail than UD tags.

    I am not really sure why you would want to do this instead of just getting used to the real tags, and I suspect that trying to change this will cause you a lot of headaches for little benefit, like redefining keywords in a programming language.

    That said, the easiest way to do this is probably to define a custom token extension, call it my_pos, that translates the real tags to your tags. That would look a little like this:

    POS_MAP = {"NOUN": "NIA", "VERB": "BO", ...}
    
    def my_pos_getter(token):
        return POS_MAP[token.pos_]
    
    Token.set_extension("my_pos", getter=my_pos_getter)
    
    doc = nlp("I have a pen")
    assert doc[3]._.my_pos == "NIA"