python python-3.x text spacy string-matching

Matching Spacy with double punctuation

I am using spacy with Matcher to detect some words. When I want to find a word with a single punctuation like - works:

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# 
pattern = [{"LOWER": "nice"}, {"IS_PUNCT": True}, {"LOWER": "word"}]
matcher.add("nice-word", [pattern])

doc = nlp("This is a nice-word also? Why is this a nice word")

matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  
    span = doc[start:end] 
    print(match_id, string_id, start, end, span.text)

Output:

1899655961849619838 nice-word 3 6 nice-word

This works great! But imagine we have a word with double -, I can't get it work. I would like to find a word for example: nice-word-also. Here is some reproducible code:

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# 
pattern = [{"LOWER": "nice"}, {"IS_PUNCT": True}, {"LOWER": "word"}, {"LOWER": "also"}]
matcher.add("nice-word-also", [pattern])

doc = nlp("This is a nice-word-also? Why is this a nice word")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  
    span = doc[start:end] 
    print(match_id, string_id, start, end, span.text)

This doesn't return anything. So I was wondering if anyone knows how to use Spacy matches to detect words with double punctuation like the example above?

Solution

You are missing one {"IS_PUNCT": True} in your pattern:

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# 
pattern = [{"LOWER": "nice"}, {"IS_PUNCT": True}, {"LOWER": "word"}, {"IS_PUNCT": True}, {"LOWER": "also"}]
matcher.add("nice-word-also", [pattern])

doc = nlp("This is a nice-word-also? Why is this a nice word")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  
    span = doc[start:end] 
    print(match_id, string_id, start, end, span.text)

#output
9732713127922352434 nice-word-also 3 8 nice-word-also