nlp spacy text-parsing matcher dependency-parsing

SpaCy Matcher - Restricting Potential Matches

Not too sure exactly how to word the problem, so thank you for indulging the title...

I'm using SpaCy's Matcher function to parse clauses (adverbial/prepositional/etc.) as a part of pre-processing. Some of these clauses are fairly complex and it would be impossible to create strict rules for every instance. Consequently, I have utilized {'OP': ''}* in my Matcher to account for the tokens that I cannot manually create rules for. My issue: is that each clause type cannot permit certain token types. I would like to create a rule within my Pattern Matcher that permits all token types, except for particular tokens that I could specify.

Simplified version of my current Matcher for Adjectival Clauses:

pattern = [{'TAG': ',', 'OP': '+'},
           {'DEP': 'det', 'OP': '*'},
           {'DEP': 'det', 'OP': '*'},
           {'DEP': 'amod', 'OP': '+'},
           {'OP': '*'},
           {'TAG': '.', 'OP': '+'}]

GOAL: Maintain the core structure of the pattern while being able to exclude "ROOT" dependencies, because the inclusion of "ROOT" Dependency Tokens create false matches.

I have tried to add {'DEP': 'ROOT', 'OP': '!'} to create an exception for {'OP': ''}*. The code resultingly looks like this:

pattern = [{'TAG': ',', 'OP': '+'},
           {'DEP': 'det', 'OP': '*'},
           {'DEP': 'det', 'OP': '*'},
           {'DEP': 'amod', 'OP': '+'},
           {'OP': '*'},
           {'DEP': 'ROOT', 'OP': '!'}
           {'TAG': '.', 'OP': '+'}]

I expected the matcher to initially parse the unwanted token and accept it in the Matcher, then reject it once it hit the {'DEP': 'ROOT', 'OP': '!'} rule. The goal is to be able to parse the clause from sentence (1) and not parse sentence (2):

(1) "It has started a revolution, this merry band." (2) "And yes, this merry band isn’t all happy or all dudes."

As far as I'm aware, {'OP': '*'} is the only rule that will accept all tokens and {'DEP': 'ROOT', 'OP': '!'} is the only rule to negate tokens. I've tried to mix the order but that hasn't helped either.

If anyone knows of a way to utilize the {'OP': '*'} rule while also being able to restrict specific token types that would be greatly appreciated. Thank you!

Solution

Okay friends, I found the answer. Here are two solutions:

(1) If you know the beginning and end rules for your span, as well as its token length, you can use the 'NOT_IN' function within the Matcher to accept all possible tokens except ones you choose to prohibit. The below Matcher defines the beginning, end, and middle tokens. The beginning and end should be clear. The middle token can be anything other than the DEP, TAG, POS, etc. you define. In this case, we want to match any single token except a 'nsubj' and 'PUNCT'.

pattern01 = [{'POS': 'SCONJ', 'OP': '+'},
             {'DEP':{'NOT_IN': ['nsubj']}, 'POS':{'NOT_IN': ['PUNCT']}},
             {'POS': 'VERB', 'OP': '+'}]

This pattern (1) matches an 'SCONJ' in the first token, (2) matches any token that is not an 'nsubj' or 'PUCNT' in the second token, and (3) matches any 'VERB' in the third token. But what if there are an invariable number of tokens between your desired beginning and end token?

(2) In order to accept indefinitely long matches, while specifying the beginning and end tokens, we combine the {'OP': '*'} and 'NOT_IN' functions. We modify the above code as follows:

pattern01 = [{'POS': 'SCONJ', 'OP': '+'},
             {'OP': '*', 'DEP':{'NOT_IN': ['nsubj']}, 'POS':{'NOT_IN': ['PUNCT']}},
             {'POS': 'VERB', 'OP': '+'}]

This pattern (1) matches an 'SCONJ' in the first token, (2) matches an indefinite string of tokens so long as they are not 'nsubj' or 'PUNCT', and (3) matches any 'VERB' in the third token.

The 'OP': '*' tells the Matcher to accept any token. The 'NOT_IN' specifies a list of tokens that should be exempt from the aforementioned rule. If one of these specified tokens does exist within the pattern, the matcher will not match the span.

Best of luck everyone!