I want to use spaCy's Matcher class on a new language (Hebrew) for which spaCy does not yet have a working model.
I found a working tokenizer + POS tagger (from Stanford NLP), yet I would prefer spaCy as its Matcher can help me do some rule-based NER.
Can the rule-based Matcher be fed with POS-tagged text instead of the standard NLP pipeline?
You can set the words and tags for a spacy document from another source by hand and then use the Matcher. Here's an example using English words/tags just to demonstrate:
from spacy.lang.he import Hebrew
from spacy.tokens import Doc
from spacy.matcher import Matcher
words = ["my", "words"]
tags = ["PRP$", "NNS"]
nlp = Hebrew()
doc = Doc(nlp.vocab, words=words)
for i in range(len(doc)):
doc[i].tag_ = tags[i]
# This is normally set by the tagger. The Matcher validates that
# the Doc has been tagged when you use the `"TAG"` attribute.
doc.is_tagged = True
matcher = Matcher(nlp.vocab)
pattern = [{"TAG": "PRP$"}]
matcher.add("poss", None, pattern)
print(matcher(doc))
# [(440, 0, 1)]