Search code examples
nlpspacylinguistics

Getting incorrect POS tagging


I am trying to get the POS for the sentence dragon flies to rescue the princess using below code

nlp = spacy.load("en_core_web_md")
doc = nlp("dragon flies to rescue the princess")

for token in doc:
    print(f'{token.text:{12}} {token.pos_:{12}}')

Output for above code:

dragon       NOUN         
flies        NOUN         
to           PART         
rescue       VERB         
the          DET          
princess     NOUN         

Here, 'flies' is considered as NOUN while it is VERB, is it because spacy is considering 'dragon flies' as a single word?

what should I do , if I wish to get "VERB" as POS for flies.


Solution

  • When running your example, there are two things to note:

    1. Spacy models are statistically trained models, that individually have a specific POS accuracy, in this case around 97%. Therefore, some mistakes are always to be expected, specifically when you're dealing with a corpus of a wide variety of sentences.
    2. Spacy can of course only provide meaningful tags if the sentence is grammatically correct, which is not the case for your above example.

    When I run the corrected sentence "A dragon flies to rescue the princess.", the output is

    The          DET         
    dragon       NOUN        
    flies        VERB        
    to           PART        
    rescue       VERB        
    the          DET         
    princess     NOUN        
    .            PUNCT
    

    and therefore exactly what we expected. Should your dataset contain sentences that are dealing with such syntactic errors, the "easiest" solution would probably to hand-annotate some of the examples, and utilize Spacy's training functionality, details for this can be found here. Even then, it is not guaranteed that you get significantly better results unless you annotate a lot of data, and can assert that most of the samples have "similar-looking" errors.