deep-learning nlp pytorch named-entity-recognition

NER tagging schema for non-contiguous tokens

The most common tagging procedure for NER is IOB. But it seems that this kind of tagging is limited to cases where tokens from the same entity are contiguous.

So for instance,

Jane Smith is walking in the park would be tagged as: B-PER I-PER O O O O O

And here my PER entity is the concatenation of [Jane, Smith]

If we tweak the example:

Jane and James Smith are walking in the park

B-PER O B-PER I-PER O O O O O

Now the issue is that the entities we would get are [Jane] and [James, Smith] because the IOB tagging does not allow to link Jane to Smith.

Is there any tagging schema that would allow to mark as entities both [Jane, Smith] and [James, Smith]?

Solution

First, about doing this without a new data format:

There are a paper and repo about doing this using TextAE for this:

paper

repo

However, looking at their examples and yours, it seems like you could improve on what they did by using dependency parsing. If you look at the dependency parse of "Jane and James Smith are walking in the park", you can see that spaCy understands that Jane is conjoined with Smith. So after running entity extraction, you could do a dependency parse step, then edit your entities based on that.

Now, to answer the real question. I have seen multi-dimensional labels that work in the following way (assume you have a maximum of ten entities per sentence:

empty = [0,0,0,0,0,0,0,0,0]

tokens = ["Jane", "and", "James", "Smith", "are", "walking", "in", "the", "park"]
labels = [
    [1, 0, 0, 1, 0, 0, 0, 0, 0],
    [0, 0, 1, 1, 0, 0, 0, 0, 0],
]
labels = labels + [empty] * (10-len(labels))

If you have more than one entity type, you can use those instead of just 1.

This format works better with BERT anyway, since the BIO format is a pain when you have to split up tokens into BPE anyway.