Search code examples
nlphuggingface-transformersnamed-entity-recognition

How do I extract full entity names from a hugging face model without IO tags


I am using a model from hugging face, specifically Davlan/distilbert-base-multilingual-cased-ner-hrl. However, I am not able to extract full entity names from the result.

If I run the following code:

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("Davlan/distilbert-base-multilingual-cased-ner-hrl")
model = AutoModelForTokenClassification.from_pretrained("Davlan/distilbert-base-multilingual-cased-ner-hrl")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)

example = "My name is Johnathan Smith and I work at Apple"
ner_results = nlp(example, aggregation_strategy="max")
print(ner_results)

Then I get output:

[{'entity': 'B-PER', 'score': 0.9998949, 'index': 4, 'word': 'Johna', 'start': 11, 'end': 16}, {'entity': 'I-PER', 'score': 0.999726, 'index': 5, 'word': '##tha', 'start': 16, 'end': 19}, {'entity': 'I-PER', 'score': 0.9997751, 'index': 6, 'word': '##n', 'start': 19, 'end': 20}, {'entity': 'I-PER', 'score': 0.99974835, 'index': 7, 'word': 'Smith', 'start': 21, 'end': 26}, {'entity': 'B-ORG', 'score': 0.99870986, 'index': 12, 'word': 'Apple', 'start': 41, 'end': 46}]

It looks like I might be able to post process this so Jonathan Smith is all one word. But ideally I would like this to be done for me and have no partial words identified.


Solution

  • There is a bug in the code. The aggregation strategy is in the wrong place. It should read:

    from transformers import AutoTokenizer, AutoModelForTokenClassification
    from transformers import pipeline
    
    tokenizer = AutoTokenizer.from_pretrained("Davlan/distilbert-base-multilingual-cased-ner-hrl")
    model = AutoModelForTokenClassification.from_pretrained("Davlan/distilbert-base-multilingual-cased-ner-hrl")
    nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="max")
    
    example = "My name is Johnathan Smith and I work at Apple"
    ner_results = nlp(example)
    print(ner_results)
    

    Which gives:

    [{'entity_group': 'PER', 'score': 0.99982166, 'word': 'Johnathan Smith', 'start': 11, 'end': 26}, {'entity_group': 'ORG', 'score': 0.99870986, 'word': 'Apple', 'start': 41, 'end': 46}]