Search code examples
pythonentity-frameworknlpnamed-entity-recognitionspacy

Named Entity Recognition in practice


I am a NLP novice trying to learn, and would like to better understand how Named Entity Recognition (NER) is implemented in practice, for example in popular python libraries such as spaCy.

I understand the basic concept behind it, but I suspect I am missing some details. From the documentation, it is not clear to me for example how much preprocessing is done on the text and annotation data; and what statistical model is used.

Do you know if:

  • In order to work, the text has to go through chunking before the model is trained, right? Otherwise it wouldn't be able to perform anything useful?
  • Are the text and annotations typically normalized prior to the training of the model? So that if a named entity is at the beginning or middle of a sentence it can still work?
  • Specifically in spaCy, how are things implemented concretely? Is it a HMM, CRF or something else that is used to build the model?

Apologies if this is all trivial, I am having some trouble finding easy to read documentation on NER implementations.


Solution

  • In https://spacy.io/models/en#en_core_web_md they say English multi-task CNN trained on OntoNotes. So I imagine that's how they obtain the NEs. You can see that the pipeline is

    tagger, parser, ner

    and read more here: https://spacy.io/usage/processing-pipelines. I would try to remove the different components and see what happens. This way you could see what depends on what. I'm pretty sure NER depends on tagger, but not sure whether requires the parser. All of them of course require the tokenizer

    I don't understand your second point. If an entity is at the beginning or middle of a sentence is just fine, the NER system should be able to catch it. I don't see how you're using the word normalize in a position of text context.

    Regarding the model, they mention multi-task CNN, so I guess the CNN is the model for NER. Sometimes people use a CRF on top, but they don't mention it so probably is just that. According to their performance figures, it's good enough