python nlp spacy named-entity-recognition

Create an unknown label for spaCy when returning list of text and label

I'm trying to create a condition statement for a function that will return the text and label for a passed list. Here's the code:

def get_label(text: list):
    doc = nlp('. '.join(text) + '.')
    keywords = []
    for ent in doc.ents:
        keywords.append((ent.text, ent.label_))
    return keywords

The input is:

['Kaggle', 'Google', 'San Francisco', 'this week', 'as early as tomorrow', 'Kag-ingle', 'about half a million', 'Ben Hamner', '2010', 'Earlier this month', 'YouTube', 'Google Cloud Platform', 'Crunchbase', '$12.5 to $13 million', 'Index Ventures', 'SV Angel', 'Hal Varian', 'Khosla Ventures', 'Yuri Milner']

The output is:

[('Google', 'ORG'), ('San Francisco', 'GPE'), ('this week', 'DATE'), ('as early as tomorrow', 'DATE'), ('Kag-ingle', 'PERSON'), ('about half a million', 'CARDINAL'), ('Ben Hamner', 'PERSON'), ('2010', 'DATE'), ('Earlier this month', 'DATE'), ('Google Cloud Platform', 'ORG'), ('Crunchbase', 'ORG'), ('$12.5 to $13 million', 'MONEY'), ('Index Ventures', 'ORG'), ('Hal Varian', 'PERSON'), ('Khosla Ventures', 'ORG'), ('Yuri Milner', 'PERSON')]

However, the output should include the entities that were not labelled, assigning them the "UNKNOWN" label like this:

[('Kaggle', 'UNKNOWN'), ('Google', 'ORG'), ('San Francisco', 'GPE'), ('this week', 'DATE'), ('as early as tomorrow', 'DATE'), ('Kag-ingle', 'PERSON'), ('about half a million', 'CARDINAL'), ('Ben Hamner', 'PERSON'), ('2010', 'DATE'), ('Earlier this month', 'DATE'), ('YouTube', 'UNKNOWN'), ('Google Cloud Platform', 'ORG'), ('Crunchbase', 'ORG'), ('$12.5 to $13 million', 'MONEY'), ('Index Ventures', 'ORG'), ('Hal Varian', 'PERSON'), ('Khosla Ventures', 'ORG'), ('Yuri Milner', 'PERSON')]

I've tried using:

for token in doc.sents:
       keywords.append((token.text, token.label_))

Which returns:

[('Kaggle.', ''), ('Google.', ''), ('San Francisco.', ''), ('this week.', ''), ('as early as tomorrow.', ''), ('Kag-ingle.', ''), ('about half a million.', ''), ('Ben Hamner. 2010.', ''), ('Earlier this month.', ''), ('YouTube.', ''), ('Google Cloud Platform.', ''), ('Crunchbase.', ''), ('$12.5 to $13 million.', ''), ('Index Ventures.', ''), ('SV Angel.', ''), ('Hal Varian.', ''), ('Khosla Ventures.', ''), ('Yuri Milner.', '')]

This is (assuming) because there is a period at the end of each token preventing any label from returning.

If anyone has an idea of how I can fix this, I'd really appreciate the help.

Solution

Iterate over the items passed in and check whether they match one of the returned entities after spaCy has performed the labelling (see solution below).

Notes:

The output labels vary depending on the spaCy version and pipeline/pipeline version being used. I used spaCy 3.5.3 and the en_core_web_trf==3.5.0 pipeline to produce the following results.
spaCy returned "Bill Hamner" as "Bill Hamner." as the labelled entity, hence the extra condition in the if statement to check for these edge cases.

Solution

import spacy

txt = ['Kaggle', 'Google', 'San Francisco', 'this week', 'as early as tomorrow', 'Kag-ingle', 'about half a million', 'Ben Hamner', '2010', 'Earlier this month', 'YouTube', 'Google Cloud Platform', 'Crunchbase', '$12.5 to $13 million', 'Index Ventures', 'SV Angel', 'Hal Varian', 'Khosla Ventures', 'Yuri Milner']

nlp = spacy.load("en_core_web_trf")


def get_label(text: list):
    doc = nlp(". ".join(text) + ".")
    keywords = []
    for item in text:
        found_label = False
        for ent in doc.ents:
            if item == ent.text or (ent.text[-1] == "." and item == ent.text[:-1]):
                found_label = True
                keywords.append((item, ent.label_))
                break
        if not found_label:
            keywords.append((item, "UNKNOWN"))
    return keywords


for kw in get_label(txt):
    print(kw)

Output:

('Kaggle', 'UNKNOWN')
('Google', 'ORG')
('San Francisco', 'GPE')
('this week', 'DATE')
('as early as tomorrow', 'DATE')
('Kag-ingle', 'UNKNOWN')
('about half a million', 'CARDINAL')
('Ben Hamner', 'PERSON')
('2010', 'DATE')
('Earlier this month', 'DATE')
('YouTube', 'ORG')
('Google Cloud Platform', 'UNKNOWN')
('Crunchbase', 'ORG')
('$12.5 to $13 million', 'MONEY')
('Index Ventures', 'ORG')
('SV Angel', 'UNKNOWN')
('Hal Varian', 'PERSON')
('Khosla Ventures', 'ORG')
('Yuri Milner', 'PERSON')

Some premature optimization for the get_label function which may be faster if dealing with very large documents returned by the spaCy pipline (i.e. a very large tuple of labelled entities for doc.ents). I'll leave it up to you to time the difference to see if its worth using this variation in your end-application:

def get_label(text: list):
    doc = nlp(". ".join(text) + ".")
    ents = list(doc.ents)
    keywords = []
    for item in text:
        found_label = False
        for idx, ent in enumerate(ents):
            if item == ent.text or (ent.text[-1] == "." and item == ent.text[:-1]):
                found_label = True
                keywords.append((item, ent.label_))
                ents.pop(idx)  # reduce size of list to make subsequent searches faster
                break
        if not found_label:
            keywords.append((item, "UNKNOWN"))
    return keywords