Search code examples
pythondependenciesspacynamed-entity-recognitionspacy-3

How to use custom named enitities dataset in spacy's DependecyMatcher?


Suppose I have created a spacy model or dataset with all named entities, tagged as a PERSON, from a certain text. How can I apply it in DependencyMatcher, if I need to extract pairs "person" - "root verb"? In other words I want DependencyMatcher to use not its custom model of identifying people's names, but my, already made, dataset of names.

import spacy
from spacy.matcher import DependencyMatcher

nlp = spacy.load("en_core_web_lg")
def on_match(matcher, doc, id, matches):
    return matches

patterns = [
        [#pattern1 (sur)name Jack lived
        {
            "RIGHT_ID": "person",
            "RIGHT_ATTRS": {"ENT_TYPE": "PERSON", "DEP": "nsubj"}
        },
        {
            "LEFT_ID": "person",
            "REL_OP": "<",
            "RIGHT_ID": "verb",
            "RIGHT_ATTRS": {"POS": "VERB"}
        }
        ]
matcher = DependencyMatcher(nlp.vocab)
matcher.add("PERVERB", patterns, on_match=on_match)

Solution

  • The DependencyMatcher does not have a "custom model of identifying people's names" - that's the NER component in the pipeline you loaded. In this case you should:

    1. disable the NER component
    2. Use an EntityRuler to label names
    3. Use the DependencyMatcher as usual

    To disable a component you can just do this:

    nlp = spacy.load("en_core_web_lg", disable=["ner"])
    

    To match names from your list with an EntityRuler, see the rule-based matching docs.


    Note that the above assumes you have a list of names, rather than annotations in sentences on exactly what is a name. If you have explicitly annotated names, then you can skip step 2 - disabling the NER component will be enough to leave only your existing annotations.