Search code examples
pythonnlpspacy

spaCy matcher unable to identitfy the pattern besides the first


Unable to find where did my pattern go wrong to cause the outcome.

The Sentence I want to find:"#1 – January 31, 2015" and any date that follows this format.

The pattern pattern1=[{'ORTH':'#'},{'is_digital':True},{'is_space':True},{'ORTH':'-'},{'is_space':True},{'is_alpha':True},{'is_space':True},{'is_digital':True},{'is_punct':True},{'is_space':True},{'is_digital':True}]

The print code:print("Matches1:", [doc[start:end].text for match_id, start, end in matches1])

The result: ['#', '#', '#']

Expected result: ['#1 – January 31, 2015','#5 – March 15, 2017','#177 – Novenmber 22, 2019']


Solution

  • Spacy's matcher operates over tokens, single spaces in the sentence do not yield tokens. Also there are different characters which resemble hyphens : dashes, minus signs etc.. one has to be careful about that. The following code works:

    import spacy
    nlp = spacy.load('en_core_web_lg')
    from spacy.matcher import Matcher
    pattern1=[{'ORTH':'#'},{'IS_DIGIT':True},{'ORTH':'–'},{'is_alpha':True},{'IS_DIGIT':True},{'is_punct':True},{'IS_DIGIT':True}]
    
    doc = nlp("#1 – January 31, 2015")
    
    matcher = Matcher(nlp.vocab)
    matcher.add("p1", None, pattern1)
    
    matches1 = matcher(doc)
    print(" Matches1:", [doc[start:end].text for match_id, start, end in matches1])
    # Matches1: ['#1 – January 31, 2015']