I am new to SpaCy. I noticed that there are a number of NER categories listed in the documentation of all en_core_web
models:
'CARDINAL',
'DATE',
'EVENT',
'FAC',
'GPE',
'LANGUAGE',
'LAW',
'LOC',
'MONEY',
'NORP',
'ORDINAL',
'ORG',
'PERCENT',
'PERSON',
'PRODUCT',
'QUANTITY',
'TIME',
'WORK_OF_ART'
I need to access the raw data used to assign each word the correct category. In other words, what's the list of words labelled as 'WORK_OF_ART'
, and is this list available?
The reason I ask this question is that I want to build a custom model that uses some of the default NER categories, as well as my own.
Depending on which variant of en_core_web
, the data varies,
Dataset | License | URL | web_sm | web_md | eweb_lg | web_trf |
---|---|---|---|---|---|---|
OntoNotes 5 | LDC Non-Members | https://catalog.ldc.upenn.edu/LDC2013T19 | ✓ | ✓ | ✓ | ✓ |
Wordnet 3.0 | WordNet License | https://wordnet.princeton.edu/download | ✓ | ✓ | ✓ | ✓ |
ClearNLP Constituent-to-Dependency Conversion | Apache 2.0 | dependency_conversion.md | ✓ | ✓ | ✓ | ✓ |
GloVe Common Crawl | Apache 2.0 | https://nlp.stanford.edu/projects/glove/ | ✕ | ✓ | ✓ | ✕ |
Roberta Base | ??? | Fairseq Roberta |
The NER labelling scheme as described from https://spacy.io/models/en is from OntoNotes that contains NER tags, see Section 2.6 of https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf
The NER tags adopts the CONLL BIO format, see https://github.com/yuchenlin/OntoNotes-5.0-NER-BIO and when read properly, each sentence should be a list of tuples, e.g. Get Stanford NER result through NLTK with IOB format
Also take a look at https://github.com/flairNLP/flair/ when it comes to training NER using Ontonotes, it might help.