Search code examples
pythonnlpspacynamed-entity-recognition

how to train spacy model which treats & and 'and' similar for accurate prediction


i've trained a spacy NER model which has text mapped to Company entity during training as -

John & Doe & One pvt ltd -> Company

Now in some cases I find, if giving sentence as below during prediction is being categorized as Others-

John and Doe and One pvt ltd -> Other

What should be done to overcome this problem where we have cases of "& == and" and "v == vs == versus" etc cases to be understood by the model has same meaning ?


Solution

  • For these kinds of cases you want to add lexeme norms or token norms.

    # lexeme norm
    nlp.vocab["and"].norm_ = "&"
    # token norm
    doc[1].norm_ = "&"
    

    The statistical models all use token.norm instead of token.orth as a feature by default. You can set token.norm_ for an individual token in a doc (sometimes you might want normalizations that depend on the context), or set nlp.vocab["word"].norm_ as the default for any token that doesn't have an individual token.norm set.

    If you add lexeme norms to the vocab and save the model with nlp.to_disk, the lexeme norms are included in the saved model.