Search code examples
pythonnlpspacynamed-entity-recognition

Create an unknown label for spaCy when returning list of text and label


I'm trying to create a condition statement for a function that will return the text and label for a passed list. Here's the code:

def get_label(text: list):
    doc = nlp('. '.join(text) + '.')
    keywords = []
    for ent in doc.ents:
        keywords.append((ent.text, ent.label_))
    return keywords

The input is:

['Kaggle', 'Google', 'San Francisco', 'this week', 'as early as tomorrow', 'Kag-ingle', 'about half a million', 'Ben Hamner', '2010', 'Earlier this month', 'YouTube', 'Google Cloud Platform', 'Crunchbase', '$12.5 to $13 million', 'Index Ventures', 'SV Angel', 'Hal Varian', 'Khosla Ventures', 'Yuri Milner']

The output is:

[('Google', 'ORG'), ('San Francisco', 'GPE'), ('this week', 'DATE'), ('as early as tomorrow', 'DATE'), ('Kag-ingle', 'PERSON'), ('about half a million', 'CARDINAL'), ('Ben Hamner', 'PERSON'), ('2010', 'DATE'), ('Earlier this month', 'DATE'), ('Google Cloud Platform', 'ORG'), ('Crunchbase', 'ORG'), ('$12.5 to $13 million', 'MONEY'), ('Index Ventures', 'ORG'), ('Hal Varian', 'PERSON'), ('Khosla Ventures', 'ORG'), ('Yuri Milner', 'PERSON')]

However, the output should include the entities that were not labelled, assigning them the "UNKNOWN" label like this:

[('Kaggle', 'UNKNOWN'), ('Google', 'ORG'), ('San Francisco', 'GPE'), ('this week', 'DATE'), ('as early as tomorrow', 'DATE'), ('Kag-ingle', 'PERSON'), ('about half a million', 'CARDINAL'), ('Ben Hamner', 'PERSON'), ('2010', 'DATE'), ('Earlier this month', 'DATE'), ('YouTube', 'UNKNOWN'), ('Google Cloud Platform', 'ORG'), ('Crunchbase', 'ORG'), ('$12.5 to $13 million', 'MONEY'), ('Index Ventures', 'ORG'), ('Hal Varian', 'PERSON'), ('Khosla Ventures', 'ORG'), ('Yuri Milner', 'PERSON')]

I've tried using:

for token in doc.sents:
       keywords.append((token.text, token.label_))

Which returns:

[('Kaggle.', ''), ('Google.', ''), ('San Francisco.', ''), ('this week.', ''), ('as early as tomorrow.', ''), ('Kag-ingle.', ''), ('about half a million.', ''), ('Ben Hamner. 2010.', ''), ('Earlier this month.', ''), ('YouTube.', ''), ('Google Cloud Platform.', ''), ('Crunchbase.', ''), ('$12.5 to $13 million.', ''), ('Index Ventures.', ''), ('SV Angel.', ''), ('Hal Varian.', ''), ('Khosla Ventures.', ''), ('Yuri Milner.', '')]

This is (assuming) because there is a period at the end of each token preventing any label from returning.

If anyone has an idea of how I can fix this, I'd really appreciate the help.


Solution

  • Iterate over the items passed in and check whether they match one of the returned entities after spaCy has performed the labelling (see solution below).

    Notes:

    • The output labels vary depending on the spaCy version and pipeline/pipeline version being used. I used spaCy 3.5.3 and the en_core_web_trf==3.5.0 pipeline to produce the following results.
    • spaCy returned "Bill Hamner" as "Bill Hamner." as the labelled entity, hence the extra condition in the if statement to check for these edge cases.

    Solution

    import spacy
    
    txt = ['Kaggle', 'Google', 'San Francisco', 'this week', 'as early as tomorrow', 'Kag-ingle', 'about half a million', 'Ben Hamner', '2010', 'Earlier this month', 'YouTube', 'Google Cloud Platform', 'Crunchbase', '$12.5 to $13 million', 'Index Ventures', 'SV Angel', 'Hal Varian', 'Khosla Ventures', 'Yuri Milner']
    
    nlp = spacy.load("en_core_web_trf")
    
    
    def get_label(text: list):
        doc = nlp(". ".join(text) + ".")
        keywords = []
        for item in text:
            found_label = False
            for ent in doc.ents:
                if item == ent.text or (ent.text[-1] == "." and item == ent.text[:-1]):
                    found_label = True
                    keywords.append((item, ent.label_))
                    break
            if not found_label:
                keywords.append((item, "UNKNOWN"))
        return keywords
    
    
    for kw in get_label(txt):
        print(kw)
    

    Output:

    ('Kaggle', 'UNKNOWN')
    ('Google', 'ORG')
    ('San Francisco', 'GPE')
    ('this week', 'DATE')
    ('as early as tomorrow', 'DATE')
    ('Kag-ingle', 'UNKNOWN')
    ('about half a million', 'CARDINAL')
    ('Ben Hamner', 'PERSON')
    ('2010', 'DATE')
    ('Earlier this month', 'DATE')
    ('YouTube', 'ORG')
    ('Google Cloud Platform', 'UNKNOWN')
    ('Crunchbase', 'ORG')
    ('$12.5 to $13 million', 'MONEY')
    ('Index Ventures', 'ORG')
    ('SV Angel', 'UNKNOWN')
    ('Hal Varian', 'PERSON')
    ('Khosla Ventures', 'ORG')
    ('Yuri Milner', 'PERSON')
    

    Some premature optimization for the get_label function which may be faster if dealing with very large documents returned by the spaCy pipline (i.e. a very large tuple of labelled entities for doc.ents). I'll leave it up to you to time the difference to see if its worth using this variation in your end-application:

    def get_label(text: list):
        doc = nlp(". ".join(text) + ".")
        ents = list(doc.ents)
        keywords = []
        for item in text:
            found_label = False
            for idx, ent in enumerate(ents):
                if item == ent.text or (ent.text[-1] == "." and item == ent.text[:-1]):
                    found_label = True
                    keywords.append((item, ent.label_))
                    ents.pop(idx)  # reduce size of list to make subsequent searches faster
                    break
            if not found_label:
                keywords.append((item, "UNKNOWN"))
        return keywords