Search code examples
pythonnlpspacymatcher

spaCy: custom attributes not matching correctly?


I have been having problems using custom extension attributes with the recently improved Matcher (spaCy 2.012). Even a simple example (mostly copied from here) is not working as I expected:

import spacy
from spacy.tokens import Token
from spacy.matcher import Matcher

nlp = spacy.load('en')
text = 'I have apple. I have had nothing.'
doc = nlp(text)


def on_match(matcher, doc, id, matches):
    print('Matched!', matches)


Token.set_extension('is_fruit', getter=lambda token: token.text in ('apple', 'banana'))
pattern1 = [{'LEMMA': 'have'}, {'_': {'is_fruit': True}}]
matcher = Matcher(nlp.vocab)
matcher.add('HAVING_FRUIT', on_match, pattern1)
matches = matcher(doc)
print(matches)

This gives the following output:

[(13835066833201802823, 1, 2), (13835066833201802823, 5, 6), (13835066833201802823, 6, 7)]

In other words, the rule correctly matches on the span 'have' (1, 2), but incorrectly matches 'have' (5, 6) and 'had' (6, 7). Furthermore, the callback function is not called. The custom attribute appears to be ignored.

When I add a new pattern, as follows:

Token.set_extension('nope', default=False)
pattern2 = [{'LEMMA': 'nothing'}]
matcher.add('NADA', on_match, pattern2)

matches = matcher(doc)
print(matches)

I get the following output:

[(12682145344353966206, 1, 2), (12682145344353966206, 5, 6), (12682145344353966206, 6, 7)]
Matched! [(12682145344353966206, 1, 2), (12682145344353966206, 5, 6), (12682145344353966206, 6, 7), (5033951595686580046, 7, 8)]
[(12682145344353966206, 1, 2), (12682145344353966206, 5, 6), (12682145344353966206, 6, 7), (5033951595686580046, 7, 8)]

The first rule functions as above. Then the second rule triggers, along with the callback function (which prints the message). There is an additional correct match for the new pattern along with the correct and erroneous matches from the first rule.

So, I have a few questions:

  1. why does pattern1 match incorrectly? (i.e. why does the _ custom attribute constraint not apply?)
  2. why does the callback function not work on the first call?
  3. why does it work upon addition of a new rule?

In my own code, when using custom attributes as constraints in subsequent patterns, these patterns match on ALL tokens. I assume this is related to the behaviour exhibited by the code above.


Solution

  • Sorry if this was confusing – but the GitHub thread you're referring to is still only the spec and proposal, i.e. the planned implementation. The changes will hopefully ship with spaCy v2.1.0 (since some of the changes to the Matcher internals are not fully backwards compatible).

    While the custom attribute matching isn't implemented yet, the basic improvements to the Matcher engine are already available on the develop branch and in the alpha version via spacy-nightly (pip install spacy-nightly). Those updates likely also resolve the inconsistent behaviour you observed with the callback function.