Search code examples
pythonnlpspacy

Python Spacy Pattern- How to tag a word based on another word?


I'm trying to write a pattern that would tag the whole word as unit based on one substring. Here's example:

terms = [{'ent': "UNIT",
         'patterns':[
            [{'lemma':'liter'}]]}]

text = "There were 46 kiloliters of juice available"

I wanna tag 'kiloliters' as Unit based on this pattern. I tried using 'lemma" but it won't work in this case.


Solution

  • You haven't said which model you're using so I'll use en_web_core_sm:

    import spacy
    from spacy.matcher import Matcher
    
    nlp = spacy.load("en_core_web_sm")
    matcher = Matcher(nlp.vocab)
    doc = nlp("There were 46 kiloliters of juice available")
    

    The first thing is that none of these have an ent_type of UNIT:

    for tok in doc:
        print(f"'{tok}': ent_type: '{tok.ent_type_}', lemma: '{tok.lemma_}'")
    
    'There': ent_type: '', lemma: 'there'
    'were': ent_type: '', lemma: 'be'
    '46': ent_type: 'CARDINAL', lemma: '46'
    'kiloliters': ent_type: '', lemma: 'kiloliter'
    'of': ent_type: '', lemma: 'of'
    'juice': ent_type: '', lemma: 'juice'
    'available': ent_type: '', lemma: 'available'
    

    Also, as you can see, the lemma of kiloliters is kiloliter. This is a bit annoying as you don't want to have to specify milliliters, liters etc. separately. One alternative is to look for a CARDINAL token (which also includes words e.g. "two liters") followed a regex:

    doc = nlp("""
              There were 46 kiloliters of juice available.
              I could not drink more than two liters a day.
              I would only give a child 500 milliliters.
              """
    )
    pattern = [{'ENT_TYPE': 'CARDINAL'},
               {"TEXT": {"REGEX": "^.*(liter)s?$"}}]
    
    matcher.add("unit", [pattern])
    
    matches = matcher(doc, as_spans=True)
    for span in matches:
        print(span[-1].text)
    

    Output:

    kiloliters
    liters
    milliliters