Search code examples
nlptokenizetransformer-modelnamed-entity-recognitionhuggingface-transformers

How to reconstruct text entities with Hugging Face's transformers pipelines without IOB tags?


I've been looking to use Hugging Face's Pipelines for NER (named entity recognition). However, it is returning the entity labels in inside-outside-beginning (IOB) format but without the IOB labels. So I'm not able to map the output of the pipeline back to my original text. Moreover, the outputs are masked in BERT tokenization format (the default model is BERT-large).

For example:

from transformers import pipeline
nlp_bert_lg = pipeline('ner')
print(nlp_bert_lg('Hugging Face is a French company based in New York.'))

The output is:

[{'word': 'Hu', 'score': 0.9968873858451843, 'entity': 'I-ORG'},
{'word': '##gging', 'score': 0.9329522848129272, 'entity': 'I-ORG'},
{'word': 'Face', 'score': 0.9781811237335205, 'entity': 'I-ORG'},
{'word': 'French', 'score': 0.9981815814971924, 'entity': 'I-MISC'},
{'word': 'New', 'score': 0.9987512826919556, 'entity': 'I-LOC'},
{'word': 'York', 'score': 0.9976728558540344, 'entity': 'I-LOC'}]

As you can see, New York is broken up into two tags.

How can I map Hugging Face's NER Pipeline back to my original text?

Transformers version: 2.7


Solution

  • EDIT 12/2023: As pointed out, the grouped_entities parameter has been deprecated. The correct way is to use the aggregation_strategy parameters as pointed in the source code . For instance:

    text = 'Hugging Face is a French company based in New York.'
    tagger = pipeline(task='ner', aggregation_strategy='simple')
    named_ents = tagger(text)
    pd.DataFrame(named_ents)
    

    Gives the following output

    [
       {
          "entity_group":"ORG",
          "score":0.96934015,
          "word":"Hugging Face",
          "start":0,
          "end":12
       },
       {
          "entity_group":"MISC",
          "score":0.9981816,
          "word":"French",
          "start":18,
          "end":24
       },
       {
          "entity_group":"LOC",
          "score":0.9982121,
          "word":"New York",
          "start":42,
          "end":50
       }
    ]
    

    ORIGINAL ANSWER: The 17th of May, a new pull request https://github.com/huggingface/transformers/pull/3957 with what you are asking for has been merged, therefore now our life is way easier, you can you it in the pipeline like

    ner = pipeline('ner', grouped_entities=True)
    

    and your output will be as expected. At the moment you have to install from the master branch since there is no new release yet. You can do it via

    pip install git+git://github.com/huggingface/transformers.git@48c3a70b4eaedab1dd9ad49990cfaa4d6cb8f6a0