Search code examples
pythonnlpspacymatcherpos-tagger

Accessing out of range word in spaCy doc : why does it work?


I'm learning spaCy and am playing with Matchers.

I have:

  • a very basic sentence ("white shepherd dog")
  • a matcher object, searching for a pattern ("white shepherd")
  • a print to show the match, and the word and POS before that match

I just wanted to check how to handle the index out of range exception I'm expecting to get because there's nothing before the match. I didn't expect it to work, but it did and is returning 'dog', which is after the match... and now I'm confused.

It looks like spaCy uses a circular list (or deque I think) ?

This needs a language model to run, you can install it with the following command line, if you'd like to reproduce it:

python -m spacy download en_core_web_md

And this is the code

import spacy
from spacy.matcher import Matcher 

# Loading language model
nlp = spacy.load("en_core_web_md")

# Initialising with shared vocab
matcher = Matcher(nlp.vocab)

# Adding statistical predictions
matcher.add("DOG", None, [{"LOWER": "white"}, {"LOWER": "shepherd"}])  # searching for white shepherd
doc = nlp("white shepherd dog")

for match_id, start, end in matcher(doc):
    span = doc[start:end]  
    print("Matched span: ", span.text)   
    # Get previous token and its POS
    print("Previous token: ", doc[start - 1].text, doc[start - 1].pos_) # I would expect the error here

I get the following:

>>> Matched span:  white shepherd
>>> Previous token:  dog PROPN

Can someone explain what's going on ?

Thanks !


Solution

  • You are looking for a token at index 0-1 which evaluated to -1, which is the last token.

    I recommend using the Token.nbor method to look for the first token before the span, and if no previous token exists make it None or an empty string.

    import spacy
    from spacy.matcher import Matcher 
    
    # Loading language model
    nlp = spacy.load("en_core_web_md")
    
    # Initialising with shared vocab
    matcher = Matcher(nlp.vocab)
    
    # Adding statistical predictions
    matcher.add("DOG", None, [{"LOWER": "white"}, {"LOWER": "shepherd"}])  # searching for white shepherd
    doc = nlp("white shepherd dog")
    
    for match_id, start, end in matcher(doc):
        span = doc[start:end]
        print("Matched span: ", span.text)
        try:
            nbor_tok = span[0].nbor(-1)
            print("Previous token:", nbor_tok, nbor_tok.pos_)
        except IndexError:
            nbor_tok = ''
            print("Previous token: None None")