I'm trying to create a condition statement for a function that will return the text and label for a passed list. Here's the code:
def get_label(text: list):
doc = nlp('. '.join(text) + '.')
keywords = []
for ent in doc.ents:
keywords.append((ent.text, ent.label_))
return keywords
The input is:
['Kaggle', 'Google', 'San Francisco', 'this week', 'as early as tomorrow', 'Kag-ingle', 'about half a million', 'Ben Hamner', '2010', 'Earlier this month', 'YouTube', 'Google Cloud Platform', 'Crunchbase', '$12.5 to $13 million', 'Index Ventures', 'SV Angel', 'Hal Varian', 'Khosla Ventures', 'Yuri Milner']
The output is:
[('Google', 'ORG'), ('San Francisco', 'GPE'), ('this week', 'DATE'), ('as early as tomorrow', 'DATE'), ('Kag-ingle', 'PERSON'), ('about half a million', 'CARDINAL'), ('Ben Hamner', 'PERSON'), ('2010', 'DATE'), ('Earlier this month', 'DATE'), ('Google Cloud Platform', 'ORG'), ('Crunchbase', 'ORG'), ('$12.5 to $13 million', 'MONEY'), ('Index Ventures', 'ORG'), ('Hal Varian', 'PERSON'), ('Khosla Ventures', 'ORG'), ('Yuri Milner', 'PERSON')]
However, the output should include the entities that were not labelled, assigning them the "UNKNOWN" label like this:
[('Kaggle', 'UNKNOWN'), ('Google', 'ORG'), ('San Francisco', 'GPE'), ('this week', 'DATE'), ('as early as tomorrow', 'DATE'), ('Kag-ingle', 'PERSON'), ('about half a million', 'CARDINAL'), ('Ben Hamner', 'PERSON'), ('2010', 'DATE'), ('Earlier this month', 'DATE'), ('YouTube', 'UNKNOWN'), ('Google Cloud Platform', 'ORG'), ('Crunchbase', 'ORG'), ('$12.5 to $13 million', 'MONEY'), ('Index Ventures', 'ORG'), ('Hal Varian', 'PERSON'), ('Khosla Ventures', 'ORG'), ('Yuri Milner', 'PERSON')]
I've tried using:
for token in doc.sents:
keywords.append((token.text, token.label_))
Which returns:
[('Kaggle.', ''), ('Google.', ''), ('San Francisco.', ''), ('this week.', ''), ('as early as tomorrow.', ''), ('Kag-ingle.', ''), ('about half a million.', ''), ('Ben Hamner. 2010.', ''), ('Earlier this month.', ''), ('YouTube.', ''), ('Google Cloud Platform.', ''), ('Crunchbase.', ''), ('$12.5 to $13 million.', ''), ('Index Ventures.', ''), ('SV Angel.', ''), ('Hal Varian.', ''), ('Khosla Ventures.', ''), ('Yuri Milner.', '')]
This is (assuming) because there is a period at the end of each token preventing any label from returning.
If anyone has an idea of how I can fix this, I'd really appreciate the help.
Iterate over the items passed in and check whether they match one of the returned entities after spaCy has performed the labelling (see solution below).
Notes:
en_core_web_trf==3.5.0
pipeline to produce the following results.if
statement to check for these edge cases.import spacy
txt = ['Kaggle', 'Google', 'San Francisco', 'this week', 'as early as tomorrow', 'Kag-ingle', 'about half a million', 'Ben Hamner', '2010', 'Earlier this month', 'YouTube', 'Google Cloud Platform', 'Crunchbase', '$12.5 to $13 million', 'Index Ventures', 'SV Angel', 'Hal Varian', 'Khosla Ventures', 'Yuri Milner']
nlp = spacy.load("en_core_web_trf")
def get_label(text: list):
doc = nlp(". ".join(text) + ".")
keywords = []
for item in text:
found_label = False
for ent in doc.ents:
if item == ent.text or (ent.text[-1] == "." and item == ent.text[:-1]):
found_label = True
keywords.append((item, ent.label_))
break
if not found_label:
keywords.append((item, "UNKNOWN"))
return keywords
for kw in get_label(txt):
print(kw)
Output:
('Kaggle', 'UNKNOWN')
('Google', 'ORG')
('San Francisco', 'GPE')
('this week', 'DATE')
('as early as tomorrow', 'DATE')
('Kag-ingle', 'UNKNOWN')
('about half a million', 'CARDINAL')
('Ben Hamner', 'PERSON')
('2010', 'DATE')
('Earlier this month', 'DATE')
('YouTube', 'ORG')
('Google Cloud Platform', 'UNKNOWN')
('Crunchbase', 'ORG')
('$12.5 to $13 million', 'MONEY')
('Index Ventures', 'ORG')
('SV Angel', 'UNKNOWN')
('Hal Varian', 'PERSON')
('Khosla Ventures', 'ORG')
('Yuri Milner', 'PERSON')
Some premature optimization for the get_label
function which may be faster if dealing with very large documents returned by the spaCy pipline (i.e. a very large tuple of labelled entities for doc.ents
). I'll leave it up to you to time the difference to see if its worth using this variation in your end-application:
def get_label(text: list):
doc = nlp(". ".join(text) + ".")
ents = list(doc.ents)
keywords = []
for item in text:
found_label = False
for idx, ent in enumerate(ents):
if item == ent.text or (ent.text[-1] == "." and item == ent.text[:-1]):
found_label = True
keywords.append((item, ent.label_))
ents.pop(idx) # reduce size of list to make subsequent searches faster
break
if not found_label:
keywords.append((item, "UNKNOWN"))
return keywords