Search code examples
pythonnlpnltkspacynamed-entity-recognition

Encoding IOB format, entity nested inside other entity


I have a dataset and I have to do named entity recognition with it. I would convert the dataset which is a json to IOB format but i have an issue: The dataset contains entity nested in other entity, for example

Sentence:

Traitement des manifestations neurologiques progressives des patients adultes et des enfants atteints de maladie de Niemann-Pick type C

Entity:

"manifestations neurologiques progressives des patients  atteints de maladie de Niemann-Pick type C"    and     "adultes"    and    "enfants" 

How should I encode the bigger one with nested entity inside?

I thought about:

word tag
Traitement O
des O
manifestations B-Cible
neurologiques I-Cible
progressives I-Cible
des I-Cible
patients I-Cible
adultes B-Caracteristique_du_sujet
et O
des O
enfants B-Caracteristique_du_sujet
atteints I-Cible
de I-Cible
maladie I-Cible
de I-Cible
Niemann-Pick I-Cible
type I-Cible
C I-Cible

But i'm not sure it's correct and comprehensible for an algorithm then.


Solution

  • You can either take just the outermost layer and use that as NER training data, or you can use all your labels to create spans (including nested spans) and train a span categorizer. You might also want to look at the spancat example project.

    You can't represent nested spans with IOB format, so if you go that route, you'll need to manually create Doc objects with spans saved in Doc.spans.