Search code examples
pythonregexnlpspacynamed-entity-recognition

Complex Regex not working in Spacy entity ruler


I'm trying to identify the entities by passing the Regular expression (Regex) to the Spacy model using Entity Ruler but, Spacy is unable to identify based on the below regex.

But, I tested the regex here and it's working.

import model_training
import spacy

nlp = spacy.load('en_core_web_trf')
nlp.add_pipe("spacytextblob")

nlp = model_training.train_model_with_regex(nlp)

model_training.py

def train_model_with_regex(nlp):
ruler = nlp.add_pipe("entity_ruler", before="ner")
patterns = [
    {
        "label": "VOLUME",
        "pattern": [{"LOWER": {'REGEX': "(?:\d+\s(?:million|hundred|thousand|billion)*\s*)+"}}]
    }
]

ruler.add_patterns(patterns)
return nlp

I wanted to achieve this, for the below example

text = "I have spent 5 million to buy house and 70 thousand for the furniture"

expected output:

{'result': [
    {'label': 'VOLUME', 'text': '5 million'},
    {'label': 'VOLUME', 'text': '70 thousand'}
]}

Solution

  • The problem is that your pattern is supposed to match at least two tokens, while the REGEX operator is applied to a single token.

    A solution can look like

    "pattern": [
        {"TEXT": {"REGEX": r"^\d+(?:[,.]\d+)*$"}},
        {"TEXT": {"REGEX": r"^(?:million|hundred|thousand|billion)s?$"}}
    ]
    

    The LIKE_NUM entity is defined in Spacy source code mostly as a string of digits with all dots and commas removed, so the ^\d+(?:[,.]\d+)*$ pattern looks good enough. It matches a token that starts with one or more digits and then contains zero or more occurrences of a comma or dot and then one or more digits till the end of the token.