I'm trying to write a pattern that would tag the whole word as unit based on one substring. Here's example:
terms = [{'ent': "UNIT",
'patterns':[
[{'lemma':'liter'}]]}]
text = "There were 46 kiloliters of juice available"
I wanna tag 'kiloliters' as Unit based on this pattern. I tried using 'lemma" but it won't work in this case.
You haven't said which model you're using so I'll use en_web_core_sm
:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
doc = nlp("There were 46 kiloliters of juice available")
The first thing is that none of these have an ent_type
of UNIT
:
for tok in doc:
print(f"'{tok}': ent_type: '{tok.ent_type_}', lemma: '{tok.lemma_}'")
'There': ent_type: '', lemma: 'there'
'were': ent_type: '', lemma: 'be'
'46': ent_type: 'CARDINAL', lemma: '46'
'kiloliters': ent_type: '', lemma: 'kiloliter'
'of': ent_type: '', lemma: 'of'
'juice': ent_type: '', lemma: 'juice'
'available': ent_type: '', lemma: 'available'
Also, as you can see, the lemma of kiloliters
is kiloliter
. This is a bit annoying as you don't want to have to specify milliliters, liters etc. separately. One alternative is to look for a CARDINAL
token (which also includes words e.g. "two liters"
) followed a regex:
doc = nlp("""
There were 46 kiloliters of juice available.
I could not drink more than two liters a day.
I would only give a child 500 milliliters.
"""
)
pattern = [{'ENT_TYPE': 'CARDINAL'},
{"TEXT": {"REGEX": "^.*(liter)s?$"}}]
matcher.add("unit", [pattern])
matches = matcher(doc, as_spans=True)
for span in matches:
print(span[-1].text)
Output:
kiloliters
liters
milliliters