Search code examples
nlppos-taggerspacy-3

How can I enhance morphological information for English models in spaCy?


I am trying to detect verbs that are in the imperative mood using English models in spaCy but I am seeing morphological features that are inconsistent with the examples found in the Morphology documentation. This issue is similar to this unanswered Extracting English imperative mood from verb tags with spaCy question. Specifically, there seems to very few mood features identified.

I am not sure if I am missing some configuration or if I need to somehow train the model to better identify morphological features. Before I go down the path of training, I'd like to understand why what I am doing is not matching the documentation.

I have written a small example that demonstrates the discrepancy.

'''
Prerequisites

pip install spacy
python -m spacy download en_core_web_lg
'''
import spacy

nlp = spacy.load("en_core_web_lg")

def show_morph_as_markdown_table(doc):
    print("|Context|Token|Lemma|POS|TAG|MORPH|")
    print("|----|----|----|----|----|----|")
    for token in doc:
        print(f'|{doc}|{token.text}|{token.lemma_}|{token.pos_}|{token.tag_}|{token.morph.to_dict()}|')

def show_morph_for_sentences_as_markdown_table(sentences):
    sentence_docs = list(nlp.pipe(sentences))
    for sentence_doc in sentence_docs:
        show_morph_as_markdown_table(sentence_doc)

example_sentences = [
    "I was reading the paper",
    "I don’t watch the news, I read the paper",
    "I read the paper yesterday"
]

show_morph_for_sentences_as_markdown_table(example_sentences)

I have trimmed the output to include only rows shown in the Morphology documentation.

Context Token Lemma POS TAG MORPH
I was reading the paper reading read VERB VBG {'Aspect': 'Prog', 'Tense': 'Pres', 'VerbForm': 'Part'}
I don’t watch the news, I read the paper read read VERB VBD {'Tense': 'Past', 'VerbForm': 'Fin'}
I read the paper yesterday read read VERB VBP {'Tense': 'Pres', 'VerbForm': 'Fin'}

This is very different from the expected output of:

Context Token Lemma POS TAG MORPH
I was reading the paper reading read VERB VBG {'VerbForm': 'Ger'}
I don’t watch the news, I read the paper read read VERB VBD {'VerbForm': 'Fin', 'Mood': 'Ind', 'Tense': 'Pres'}
I read the paper yesterday read read VERB VBP {'VerbForm': 'Fin', 'Mood': 'Ind', 'Tense': 'Past'}

I've tried adding a morphologizer to the pipeline using the DEFAULT_MORPH_MODEL but was met with an initialization error. I don't know enough about the pipeline yet to understand why.

from spacy.pipeline.morphologizer import DEFAULT_MORPH_MODEL

config = {"model": DEFAULT_MORPH_MODEL}
nlp.add_pipe("morphologizer", config=config)

# ValueError: [E109] Component 'morphologizer' could not be run. Did you forget to call `initialize()`?

# trying to fix above error with the following
nlp.initialize()

# [E955] Can't find table(s) lexeme_norm for language 'en' in spacy-lookups-data. Make sure you have the package installed or provide your own lookup tables if no default lookups are available for your language.

In researching further, it appears that spaCy version 3 manages tag_map and morph_rules with AttributeRuler. Could it be possible that the downloadable models aren't including the same information that the documentation is using?

I'm hoping for an easy configuration fix that I am missing or a pointer to the right rabbit hole (I've been down many).


Solution

  • This table is the docs is just meant to be a generic example of the kinds of annotation you might see, and the exact annotation from each individual model may be different, also for each individual release/version of a model.

    You're not going to have much luck detecting imperatives using the en_core_web_* models because the training data doesn't distinguish imperatives from other forms. The rules that handle the tagset conversion are largely based on this table (note that there's no Mood=Imp for any PTB tag):

    https://universaldependencies.org/tagset-conversion/en-penn-uposf.html

    However, it does look like some of the UD English corpora do include Mood=Imp or use fine-grained tags that distinguish imperatives. To start out, you could test out a pretrained UD English EWT model from a tool like Stanza or Trankit to see if that works well enough for your task. It can be a difficult distinction to make, so I don't know how good the overall performance might be, though.

    If you'd like to keep working with spacy, you could use spacy-stanza with the default "en" models, which are trained on UD English EWT.