I am trying to use CoNLL-2003 NER (English) Dataset and I am trying to utilize pretrained embeddings for it. I am using SENNA pretrained embeddings. Now I have around 20k words in my vocabulary and out of this I have embedding available for only 9.5k words.
My current approach is to initialize an array of 20k X embedding_size
with zeros and initialize the 9.5k words whose embeddings is known to me and make all the embeddings learn-able.
My question is what is the best way to do this? Any reference to such research will be very helpful?
I would suggest three ways to tackle this problem, each with different strengths:
apple
, choose the closest (according to Levenshtein distance) word that you have an embeddings for, e.g., apples
. In my experience, this can work remarkably well. On the other hand, semantic similarity would suggest using for instance synonyms, obtained from resources like WordNet or even averaging the embeddings of words that the OOV frequently co-occurs with.