Search code examples
pythonnlpspacy

Patterns with multi-terms entries in the IN attribute


I am extending a spaCy model using rules. While looking through the documentation, I noticed the IN attribute, which is used to map patterns to a dictionary of properties. This is great however it only works on single tokens.

For example, this pattern: {"label":"EXAMPLE","pattern":[{"LOWER": {"IN": ["such as", "like", "for example"]}}]} will only work with the term like but not the others.

What is the best way to achieve the same result for multi-terms attributes?


Solution

  • It depends on how complicated the intended patterns are, but the PhraseMatcher can handle similar cases as above using the attribute LOWER:

    import spacy
    from spacy.matcher import PhraseMatcher
    
    nlp = spacy.blank("en")
    pmatcher = PhraseMatcher(nlp.vocab, attr="LOWER")
    phrases = ["such as", "like", "for example"]
    pmatcher.add("EXAMPLE", [nlp(x) for x in phrases])
    assert pmatcher(nlp("Things Such As Books")) == [(15373972490796046842, 1, 3)]