Search code examples
pythonpandasnlpspacy

How can I make spaCy matches case Insensitive


How can I make spaCy case insensitive?

Is there any code snippet that i should add or something because I couldn't get entities that are not in uppercase?

import spacy
import pandas as pd

from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', disable = ['ner'])
ruler = nlp.add_pipe("entity_ruler")


flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
    ruler.add_patterns([{"label": "flower", "pattern": f}])
animals = ["cat", "dog", "artic fox"]
for a in animals:
    ruler.add_patterns([{"label": "animal", "pattern": a}])



result={}
doc = nlp("CAT and Artic fox, plant african daisy")
for ent in doc.ents:
        result[ent.label_]=ent.text
df = pd.DataFrame([result])
print(df)

Solution

  • As long as it's okay if LOWER is used for all patterns, you can continue to use phrase patterns and add the phrase_matcher_attr option for the entity ruler. Then you don't have worry about tokenizing the phrases and if you have a lot of patterns to match, it will also be faster than using token patterns:

    import spacy
    
    nlp = spacy.load('en_core_web_sm', disable=['ner'])
    ruler = nlp.add_pipe("entity_ruler", config={"phrase_matcher_attr": "LOWER"})
    
    flowers = ["rose", "tulip", "african daisy"]
    for f in flowers:
        ruler.add_patterns([{"label": "flower", "pattern": f}])
    animals = ["cat", "dog", "artic fox"]
    for a in animals:
        ruler.add_patterns([{"label": "animal", "pattern": a}])
    
    doc = nlp("CAT and Artic fox, plant african daisy")
    for ent in doc.ents:
        print(ent, ent.label_)
    

    Output:

    CAT animal
    Artic fox animal
    african daisy flower