I am using spacy to match a particular expression in some text (in italian). My text can appear in multiple forms and I am trying to learn what's the best way to write a general rule. I have 4 cases as below,, and I would like to write a general patter that could work with all of the cases. Something like:
# case 1
text = 'Superfici principali e secondarie: 90 mq'
# case 2
# text = 'Superfici principali e secondarie di 90 mq'
# case 3
# text = 'Superfici principali e secondarie circa 90 mq'
# case 4
# text = 'Superfici principali e secondarie di circa 90 mq'
nlp = spacy.load('it_core_news_sm')
doc = nlp(text)
matcher = Matcher(nlp.vocab)
pattern = [{"LOWER": "superfici"}, {"LOWER": "principali"}, {"LOWER": "e"}, {"LOWER": "secondarie"}, << "some token here that allows max 3 tokens or a IS_PUNCT or nothing at all" >>, {"IS_DIGIT": True}, {"LOWER": "mq"}]
matcher.add("Superficie", None, pattern)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id] # Get string representation
span = doc[start:end] # The matched span
print(match_id, string_id, start, end, span.text)
You may add a {"IS_PUNCT": True, "OP": "?"}
optional token and then three optional IS_ALPHA
tokens:
pattern = [
{"LOWER": "superfici"},
{"LOWER": "principali"},
{"LOWER": "e"},
{"LOWER": "secondarie"},
{"IS_PUNCT": True, "OP": "?"},
{"IS_ALPHA": True, "OP": "?"},
{"IS_ALPHA": True, "OP": "?"},
{"IS_ALPHA": True, "OP": "?"},
{"IS_DIGIT": True},
{"LOWER": "mq"}
]
The "OP" : "?"
means the token can repeat 1 or 0 times, i.e. it can appear only once or go missing.