machine-learning nlp stanford-nlp spacy named-entity-recognition

How to train NER to recognize that a word is not an entity?

I may have worded my question poorly, but basically I have been training new models using spaCy for NER. I have trained some custom entities and it's doing a really great job when I test it. However, when I send it something that shouldn't be recognized as an entity, it seems to guess one of the entities anyways. I am guessing it's because I never trained it what would = O(I think that's how stanford does it).

Here is a sample of my training data, does this look right? Do I need to just add trash values and set the entity as O?

[ "644663" , {"entities": [[0,6, "CARDINAL"]]}],
[ "871448" , {"entities": [[0,6, "CARDINAL"]]}],
[ "6/26/1967" , {"entities": [[0,9, "DATE"]]}],
[ "1/21/1969" , {"entities": [[0,9, "DATE"]]}],
[ "GORDON GARDIN" , {"entities": [[0,13, "PERSON"]]}],
[ "CANDRA CARDINAL" , {"entities": [[0,15, "PERSON"]]}],
[ "FIAT" , {"entities": [[0,4, "CARMAKE"]]}],
[ "FORD" , {"entities": [[0,4, "CARMAKE"]]}]

Solution

You're correct in that the problem is that you haven't shown the system anything that's not an entity. You don't want to add "trash values" however. Spacy expects your training strings to be strings with entities in context, not just singular examples of entities. So one training example should look more like:

[ "My uncle drives a Ford" , {"entities": [(18,22, "CARMAKE")]}]

This will allow your system to train to recognize entities in context, and recognize more entities than just the specific training examples you give it (e.g. a well trained system would be able to recognize "Chrysler" and "Toyota" as car makes in addition to Ford and Fiat). Spacy has more in-depth examples for training custom entities, so I'd recommend you check that out.