Search code examples
pos-taggernamed-entity-recognition

POS/NER able to differentiate between the same word being used in multiple contexts?


I have a collection of over 1 million bodies of text. Within those bodies are multiple entities whose names mimic common stop words and phrases.

This has created issues when tokenizing the data, as there are ~50 entities with the same problem. To counteract this, I've disabled the removal of the matched stop words before their removal. This is fine, but Ideally I'd have a way to differentiate when a token is actually meant to be a stop word vs an entity, since I only care for when it's used as an entity.

Here's a sample excerpt:

A determined somebody slept. Prior to this, A could never be comfortable with the idea of responsibility. It was foreign, something heard about through a story passed down by words of U. As slow as it could be, A began to find meaning in the words of a story.

A and U are entities/nouns in most of their usages here. POS tagging so far has only labelled A as a determiner, and NER either won't tag any instances of the word. Adding the target tags to the NER list will result in every instance being tagged as an entity, which is not the case.

So far I've primarily used the Stanford POS Tagger and SpaCY for NER.


Solution

  • I think you should try to train your own NER model.
    You can do this in three steps, as follows:

    1. label a number of documents in your corpus. You can do this using the spacy-annotator.
    2. train your spacy NER model from scratch. You can follow the instructions in the spacy docs.
    3. Use the trained model to predict entities in your corpus.

    By labelling a good amount of entities at step 1, the model will learn to differentiate between a determiner and an entity.