python nlp spacy named-entity-recognition

SpaCy 3: how to get the raw data used to train en_core_web_sm?

I am new to SpaCy. I noticed that there are a number of NER categories listed in the documentation of all en_core_web models:

'CARDINAL', 
'DATE', 
'EVENT', 
'FAC', 
'GPE', 
'LANGUAGE', 
'LAW', 
'LOC', 
'MONEY', 
'NORP', 
'ORDINAL', 
'ORG', 
'PERCENT', 
'PERSON', 
'PRODUCT', 
'QUANTITY', 
'TIME', 
'WORK_OF_ART'

I need to access the raw data used to assign each word the correct category. In other words, what's the list of words labelled as 'WORK_OF_ART', and is this list available?

The reason I ask this question is that I want to build a custom model that uses some of the default NER categories, as well as my own.

Solution

Depending on which variant of en_core_web, the data varies,

Dataset	License	URL	web_sm	web_md	eweb_lg	web_trf
OntoNotes 5	LDC Non-Members	https://catalog.ldc.upenn.edu/LDC2013T19	✓	✓	✓	✓
Wordnet 3.0	WordNet License	https://wordnet.princeton.edu/download	✓	✓	✓	✓
ClearNLP Constituent-to-Dependency Conversion	Apache 2.0	dependency_conversion.md	✓	✓	✓	✓
GloVe Common Crawl	Apache 2.0	https://nlp.stanford.edu/projects/glove/	✕	✓	✓	✕
Roberta Base	???	Fairseq Roberta

The NER labelling scheme as described from https://spacy.io/models/en is from OntoNotes that contains NER tags, see Section 2.6 of https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf

The NER tags adopts the CONLL BIO format, see https://github.com/yuchenlin/OntoNotes-5.0-NER-BIO and when read properly, each sentence should be a list of tuples, e.g. Get Stanford NER result through NLTK with IOB format

Also take a look at https://github.com/flairNLP/flair/ when it comes to training NER using Ontonotes, it might help.