Search code examples
spacysimilaritynamed-entity-recognition

Is it possible to improve spaCy's similarity results with custom named entities?


I've found that spaCy's similarity does a decent job of comparing my documents using "en_core_web_lg" out-of-the-box.

I'd like to tighten up relationships in some areas and thought adding custom NER labels to the model would help, but my results before and after show no improvements, even though I've been able to create a test set of custom entities.

Now I'm wondering, was my theory completely wrong, or could I simply be missing something in my pipeline?

If I was wrong, what's the best approach to improve results? Seems like some sort of custom labeling should help.

Here's an example of what I've tested so far:

import spacy
from spacy.pipeline import EntityRuler
from spacy.tokens import Doc
from spacy.gold import GoldParse

nlp = spacy.load("en_core_web_lg")

docA = nlp("Add fractions with like denominators.")
docB = nlp("What does one-third plus one-third equal?")

sim_before = docA.similarity(docB)
print(sim_before)

0.5949629181460099

^^ Not too shabby, but I'd like to see results closer to 0.85 in this example.
So, I use EntityRuler and add some patterns to try and tighten up the relationships:

ruler = EntityRuler(nlp)
patterns = [
    {"label": "ADDITION", "pattern": "Add"},
    {"label": "ADDITION", "pattern": "plus"},
    {"label": "FRACTION", "pattern": "one-third"},
    {"label": "FRACTION", "pattern": "fractions"},
    {"label": "FRACTION", "pattern": "denominators"},

]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler, before='ner')
print(nlp.pipe_names)

['tagger', 'parser', 'entity_ruler', 'ner']

Adding GoldParse seems to be important, so I added the following and updated NER:

doc1 = Doc(nlp.vocab, [u'What', u'does', u'one-third', u'plus', u'one-third', u'equal'])
gold1 = GoldParse(doc1, [u'0', u'0', u'U-FRACTION', u'U-ADDITION', u'U-FRACTION', u'O'])

doc2 = Doc(nlp.vocab, [u'Add', u'fractions', u'with', u'like', u'denominators'])
gold2 = GoldParse(doc2, [u'U-ADDITION', u'U-FRACTION', u'O', u'O', u'U-FRACTION'])

ner = nlp.get_pipe("ner")
losses = {}
optimizer = nlp.begin_training()
ner.update([doc1, doc2], [gold1, gold2], losses=losses, sgd=optimizer)

{'ner': 0.0}

You can see my custom entities are working, but the test results show zero improvement:

test1 = nlp("Add fractions with like denominators.")
test2 = nlp("What does one-third plus one-third equal?")

print([(ent.text, ent.label_) for ent in test1.ents])
print([(ent.text, ent.label_) for ent in test2.ents])

sim = test1.similarity(test2)
print(sim)

[('Add', 'ADDITION'), ('fractions', 'FRACTION'), ('denominators', 'FRACTION')]
[('one-third', 'FRACTION'), ('plus', 'ADDITION'), ('one-third', 'FRACTION')]
0.5949629181460099

Any tips would be greatly appreciated!


Solution

  • I found my solution was nestled in this tutorial: Text Classification in Python Using spaCy, which generates a BoW matrix for spaCy's text data by using SciKit-Learn's CountVectorizer.

    I avoided sentiment analysis tutorials, due to binary classification, since I need support for multiple categories. The trick was to set multi_class='auto' on the LogisticRegression linear model, and to use average='micro' on the precision score and precision recall, so all my text data, like entities, were leveraged:

    classifier = LogisticRegression(solver='lbfgs', multi_class='auto')
    

    and...

    print("Logistic Regression Accuracy:",metrics.accuracy_score(y_test, predicted))
    print("Logistic Regression Precision:",metrics.precision_score(y_test, predicted,average='micro'))
    print("Logistic Regression Recall:",metrics.recall_score(y_test, predicted,average='micro'))
    

    Hope this helps save someone some time!