I have a dataset and I have to do named entity recognition with it. I would convert the dataset which is a json to IOB format but i have an issue: The dataset contains entity nested in other entity, for example
Sentence:
Traitement des manifestations neurologiques progressives des patients adultes et des enfants atteints de maladie de Niemann-Pick type C
Entity:
"manifestations neurologiques progressives des patients atteints de maladie de Niemann-Pick type C" and "adultes" and "enfants"
How should I encode the bigger one with nested entity inside?
I thought about:
word | tag |
---|---|
Traitement | O |
des | O |
manifestations | B-Cible |
neurologiques | I-Cible |
progressives | I-Cible |
des | I-Cible |
patients | I-Cible |
adultes | B-Caracteristique_du_sujet |
et | O |
des | O |
enfants | B-Caracteristique_du_sujet |
atteints | I-Cible |
de | I-Cible |
maladie | I-Cible |
de | I-Cible |
Niemann-Pick | I-Cible |
type | I-Cible |
C | I-Cible |
But i'm not sure it's correct and comprehensible for an algorithm then.
You can either take just the outermost layer and use that as NER training data, or you can use all your labels to create spans (including nested spans) and train a span categorizer. You might also want to look at the spancat example project.
You can't represent nested spans with IOB format, so if you go that route, you'll need to manually create Doc objects with spans saved in Doc.spans
.