Search code examples
pythonnlpspacynamed-entity-recognition

SpaCy 3: how to get the raw data used to train en_core_web_sm?


I am new to SpaCy. I noticed that there are a number of NER categories listed in the documentation of all en_core_web models:

'CARDINAL', 
'DATE', 
'EVENT', 
'FAC', 
'GPE', 
'LANGUAGE', 
'LAW', 
'LOC', 
'MONEY', 
'NORP', 
'ORDINAL', 
'ORG', 
'PERCENT', 
'PERSON', 
'PRODUCT', 
'QUANTITY', 
'TIME', 
'WORK_OF_ART'

I need to access the raw data used to assign each word the correct category. In other words, what's the list of words labelled as 'WORK_OF_ART', and is this list available?

The reason I ask this question is that I want to build a custom model that uses some of the default NER categories, as well as my own.


Solution

  • Depending on which variant of en_core_web, the data varies,

    Dataset License URL web_sm web_md eweb_lg web_trf
    OntoNotes 5 LDC Non-Members https://catalog.ldc.upenn.edu/LDC2013T19
    Wordnet 3.0 WordNet License https://wordnet.princeton.edu/download
    ClearNLP Constituent-to-Dependency Conversion Apache 2.0 dependency_conversion.md
    GloVe Common Crawl Apache 2.0 https://nlp.stanford.edu/projects/glove/
    Roberta Base ??? Fairseq Roberta

    The NER labelling scheme as described from https://spacy.io/models/en is from OntoNotes that contains NER tags, see Section 2.6 of https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf

    The NER tags adopts the CONLL BIO format, see https://github.com/yuchenlin/OntoNotes-5.0-NER-BIO and when read properly, each sentence should be a list of tuples, e.g. Get Stanford NER result through NLTK with IOB format

    Also take a look at https://github.com/flairNLP/flair/ when it comes to training NER using Ontonotes, it might help.