I am a NLP novice trying to learn, and would like to better understand how Named Entity Recognition (NER) is implemented in practice, for example in popular python libraries such as spaCy.
I understand the basic concept behind it, but I suspect I am missing some details. From the documentation, it is not clear to me for example how much preprocessing is done on the text and annotation data; and what statistical model is used.
Do you know if:
Apologies if this is all trivial, I am having some trouble finding easy to read documentation on NER implementations.
In https://spacy.io/models/en#en_core_web_md they say English multi-task CNN trained on OntoNotes
. So I imagine that's how they obtain the NEs. You can see that the pipeline is
tagger, parser, ner
and read more here: https://spacy.io/usage/processing-pipelines. I would try to remove the different components and see what happens. This way you could see what depends on what. I'm pretty sure NER depends on tagger, but not sure whether requires the parser. All of them of course require the tokenizer
I don't understand your second point. If an entity is at the beginning or middle of a sentence is just fine, the NER system should be able to catch it. I don't see how you're using the word normalize
in a position of text context.
Regarding the model, they mention multi-task CNN, so I guess the CNN is the model for NER. Sometimes people use a CRF on top, but they don't mention it so probably is just that. According to their performance figures, it's good enough