I am writing a program that uses the spacy model en_core_web_md for Named Entity Recognition. It was not identifying all my entities correctly: for instance, there were some names of people and organisations that were not being recognised as such.
I looked up how to train the model and found this script: https://github.com/explosion/spaCy/blob/master/examples/training/train_ner.py
I downloaded the script, put it in the same folder as my program, replaced their training data with my own (containing the names I wanted it to recognise) and ran it, with model="en_core_web_md"
and output_dir="model"
instead of None
.
My project involves video game characters so my training data was:
TRAIN_DATA = [
("Who is Cave Johnson?", {"entities": [(7, 19, "PERSON")]}),
("I work for Aperture Science.", {"entities":[(11, 27, "ORG")]}),
("Wallace Breen is CEO of Black Mesa.", {"entities":[(0, 13, "PERSON"), (25, 35, "ORG")]}),
]
The train_ner script outputs the expected results. However, when I run my other program, it still does not recognise "Cave Johnson" as a PERSON
or "Black Mesa" as an ORG
. Why is the script not working?
Update: still haven't got it working. I ran the script again, to no apparent effect.
Looking in more detail on the Github "issues", it turns out that even though the example script only gives it a couple of sentences to train it, when you run the script you are expected to actually uses hundreds of examples.
This is not clear for someone with no NLP experience from reading the documentation. Hopefully now this question and answer will come up when people search for it so other people don't have to spend weeks wondering what they were doing wrong.
Basically, I just need more sentences.