i've trained a spacy NER model which has text mapped to Company entity during training as -
John & Doe & One pvt ltd -> Company
Now in some cases I find, if giving sentence as below during prediction is being categorized as Others-
John and Doe and One pvt ltd -> Other
What should be done to overcome this problem where we have cases of "& == and" and "v == vs == versus" etc cases to be understood by the model has same meaning ?
For these kinds of cases you want to add lexeme norms or token norms.
# lexeme norm
nlp.vocab["and"].norm_ = "&"
# token norm
doc[1].norm_ = "&"
The statistical models all use token.norm
instead of token.orth
as a feature by default. You can set token.norm_
for an individual token in a doc (sometimes you might want normalizations that depend on the context), or set nlp.vocab["word"].norm_
as the default for any token that doesn't have an individual token.norm
set.
If you add lexeme norms to the vocab and save the model with nlp.to_disk
, the lexeme norms are included in the saved model.