Search code examples
spacynamed-entity-recognitionspacy-transformers

Why don't spacy transformer models do NER for non-english models?


Why is it that spacy transformer models for languages like spanish (es_dep_news_trf) don't do named entity recognition.

However, for english (en_core_web_trf) it does.

In code:

import spacy    
nlp=spacy.load("en_core_web_trf")
doc=nlp("my name is John Smith and I work at Apple and I like visiting the Eiffel Tower")
print(doc.ents)
(John Smith, Apple, the Eiffel Tower)
    
nlp=spacy.load("es_dep_news_trf")
doc=nlp("mi nombre es John Smith y trabajo en Apple y me gusta visitar la Torre Eiffel")
print(doc.ents)
()

Why doesn't spanish extract entities but english does?


Solution

  • It has to do with the available training data. ner is only included for the trf models if the training data has NER annotation on the exact same data as for tagging and parsing.

    Training trf models on partial annotation does not work well in practice and an independent NER component (as in the CNN pipelines) would mean including an additional transformer component in the pipeline, which would make the pipeline a lot larger and slower.