Search code examples
pythonnlpspacynamed-entity-recognition

Using spacy's Matcher without a model


I want to use spaCy's Matcher class on a new language (Hebrew) for which spaCy does not yet have a working model.

I found a working tokenizer + POS tagger (from Stanford NLP), yet I would prefer spaCy as its Matcher can help me do some rule-based NER.

Can the rule-based Matcher be fed with POS-tagged text instead of the standard NLP pipeline?


Solution

  • You can set the words and tags for a spacy document from another source by hand and then use the Matcher. Here's an example using English words/tags just to demonstrate:

    from spacy.lang.he import Hebrew
    from spacy.tokens import Doc
    from spacy.matcher import Matcher
    
    words = ["my", "words"]
    tags = ["PRP$", "NNS"]
    
    nlp = Hebrew()
    doc = Doc(nlp.vocab, words=words)
    for i in range(len(doc)):
        doc[i].tag_ = tags[i]
    
    # This is normally set by the tagger. The Matcher validates that
    # the Doc has been tagged when you use the `"TAG"` attribute.
    doc.is_tagged = True
    
    matcher = Matcher(nlp.vocab)
    pattern = [{"TAG": "PRP$"}]
    matcher.add("poss", None, pattern)
    print(matcher(doc))
    # [(440, 0, 1)]