python nlp spacy named-entity-recognition

how to train spacy model which treats & and 'and' similar for accurate prediction

i've trained a spacy NER model which has text mapped to Company entity during training as -

John & Doe & One pvt ltd -> Company

Now in some cases I find, if giving sentence as below during prediction is being categorized as Others-

John and Doe and One pvt ltd -> Other

What should be done to overcome this problem where we have cases of "& == and" and "v == vs == versus" etc cases to be understood by the model has same meaning ?

Solution

For these kinds of cases you want to add lexeme norms or token norms.

# lexeme norm
nlp.vocab["and"].norm_ = "&"
# token norm
doc[1].norm_ = "&"

The statistical models all use token.norm instead of token.orth as a feature by default. You can set token.norm_ for an individual token in a doc (sometimes you might want normalizations that depend on the context), or set nlp.vocab["word"].norm_ as the default for any token that doesn't have an individual token.norm set.

If you add lexeme norms to the vocab and save the model with nlp.to_disk, the lexeme norms are included in the saved model.