I'm learning spaCy and am playing with Matchers.
I have:
I just wanted to check how to handle the index out of range exception I'm expecting to get because there's nothing before the match. I didn't expect it to work, but it did and is returning 'dog', which is after the match... and now I'm confused.
It looks like spaCy uses a circular list (or deque I think) ?
This needs a language model to run, you can install it with the following command line, if you'd like to reproduce it:
python -m spacy download en_core_web_md
And this is the code
import spacy
from spacy.matcher import Matcher
# Loading language model
nlp = spacy.load("en_core_web_md")
# Initialising with shared vocab
matcher = Matcher(nlp.vocab)
# Adding statistical predictions
matcher.add("DOG", None, [{"LOWER": "white"}, {"LOWER": "shepherd"}]) # searching for white shepherd
doc = nlp("white shepherd dog")
for match_id, start, end in matcher(doc):
span = doc[start:end]
print("Matched span: ", span.text)
# Get previous token and its POS
print("Previous token: ", doc[start - 1].text, doc[start - 1].pos_) # I would expect the error here
I get the following:
>>> Matched span: white shepherd
>>> Previous token: dog PROPN
Can someone explain what's going on ?
Thanks !
You are looking for a token at index 0-1 which evaluated to -1, which is the last token.
I recommend using the Token.nbor
method to look for the first token before the span, and if no previous token exists make it None or an empty string.
import spacy
from spacy.matcher import Matcher
# Loading language model
nlp = spacy.load("en_core_web_md")
# Initialising with shared vocab
matcher = Matcher(nlp.vocab)
# Adding statistical predictions
matcher.add("DOG", None, [{"LOWER": "white"}, {"LOWER": "shepherd"}]) # searching for white shepherd
doc = nlp("white shepherd dog")
for match_id, start, end in matcher(doc):
span = doc[start:end]
print("Matched span: ", span.text)
try:
nbor_tok = span[0].nbor(-1)
print("Previous token:", nbor_tok, nbor_tok.pos_)
except IndexError:
nbor_tok = ''
print("Previous token: None None")