python nlp spacy named-entity-recognition

Using spacy's Matcher without a model

I want to use spaCy's Matcher class on a new language (Hebrew) for which spaCy does not yet have a working model.

I found a working tokenizer + POS tagger (from Stanford NLP), yet I would prefer spaCy as its Matcher can help me do some rule-based NER.

Can the rule-based Matcher be fed with POS-tagged text instead of the standard NLP pipeline?

Solution

You can set the words and tags for a spacy document from another source by hand and then use the Matcher. Here's an example using English words/tags just to demonstrate:

from spacy.lang.he import Hebrew
from spacy.tokens import Doc
from spacy.matcher import Matcher

words = ["my", "words"]
tags = ["PRP$", "NNS"]

nlp = Hebrew()
doc = Doc(nlp.vocab, words=words)
for i in range(len(doc)):
    doc[i].tag_ = tags[i]

# This is normally set by the tagger. The Matcher validates that
# the Doc has been tagged when you use the `"TAG"` attribute.
doc.is_tagged = True

matcher = Matcher(nlp.vocab)
pattern = [{"TAG": "PRP$"}]
matcher.add("poss", None, pattern)
print(matcher(doc))
# [(440, 0, 1)]