I'm trying to identify the entities by passing the Regular expression (Regex) to the Spacy model using Entity Ruler but, Spacy is unable to identify based on the below regex.
But, I tested the regex here and it's working.
import model_training
import spacy
nlp = spacy.load('en_core_web_trf')
nlp.add_pipe("spacytextblob")
nlp = model_training.train_model_with_regex(nlp)
model_training.py
def train_model_with_regex(nlp):
ruler = nlp.add_pipe("entity_ruler", before="ner")
patterns = [
{
"label": "VOLUME",
"pattern": [{"LOWER": {'REGEX': "(?:\d+\s(?:million|hundred|thousand|billion)*\s*)+"}}]
}
]
ruler.add_patterns(patterns)
return nlp
I wanted to achieve this, for the below example
text = "I have spent 5 million to buy house and 70 thousand for the furniture"
expected output:
{'result': [
{'label': 'VOLUME', 'text': '5 million'},
{'label': 'VOLUME', 'text': '70 thousand'}
]}
The problem is that your pattern is supposed to match at least two tokens, while the REGEX
operator is applied to a single token.
A solution can look like
"pattern": [
{"TEXT": {"REGEX": r"^\d+(?:[,.]\d+)*$"}},
{"TEXT": {"REGEX": r"^(?:million|hundred|thousand|billion)s?$"}}
]
The LIKE_NUM
entity is defined in Spacy source code mostly as a string of digits with all dots and commas removed, so the ^\d+(?:[,.]\d+)*$
pattern looks good enough. It matches a token that starts with one or more digits and then contains zero or more occurrences of a comma or dot and then one or more digits till the end of the token.