Search code examples
pythonpython-3.xtextspacystring-matching

Matching Spacy with double punctuation


I am using spacy with Matcher to detect some words. When I want to find a word with a single punctuation like - works:

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# 
pattern = [{"LOWER": "nice"}, {"IS_PUNCT": True}, {"LOWER": "word"}]
matcher.add("nice-word", [pattern])

doc = nlp("This is a nice-word also? Why is this a nice word")

matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  
    span = doc[start:end] 
    print(match_id, string_id, start, end, span.text)

Output:

1899655961849619838 nice-word 3 6 nice-word

This works great! But imagine we have a word with double -, I can't get it work. I would like to find a word for example: nice-word-also. Here is some reproducible code:

import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
# 
pattern = [{"LOWER": "nice"}, {"IS_PUNCT": True}, {"LOWER": "word"}, {"LOWER": "also"}]
matcher.add("nice-word-also", [pattern])

doc = nlp("This is a nice-word-also? Why is this a nice word")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  
    span = doc[start:end] 
    print(match_id, string_id, start, end, span.text)

This doesn't return anything. So I was wondering if anyone knows how to use Spacy matches to detect words with double punctuation like the example above?


Solution

  • You are missing one {"IS_PUNCT": True} in your pattern:

    import spacy
    from spacy.matcher import Matcher
    
    nlp = spacy.load("en_core_web_sm")
    matcher = Matcher(nlp.vocab)
    # 
    pattern = [{"LOWER": "nice"}, {"IS_PUNCT": True}, {"LOWER": "word"}, {"IS_PUNCT": True}, {"LOWER": "also"}]
    matcher.add("nice-word-also", [pattern])
    
    doc = nlp("This is a nice-word-also? Why is this a nice word")
    matches = matcher(doc)
    for match_id, start, end in matches:
        string_id = nlp.vocab.strings[match_id]  
        span = doc[start:end] 
        print(match_id, string_id, start, end, span.text)
    
    #output
    9732713127922352434 nice-word-also 3 8 nice-word-also