Search code examples
machine-learningnlpdeep-learningword-embedding

How to initialize word-embeddings for Out of Vocabulary Word?


I am trying to use CoNLL-2003 NER (English) Dataset and I am trying to utilize pretrained embeddings for it. I am using SENNA pretrained embeddings. Now I have around 20k words in my vocabulary and out of this I have embedding available for only 9.5k words.
My current approach is to initialize an array of 20k X embedding_size with zeros and initialize the 9.5k words whose embeddings is known to me and make all the embeddings learn-able.

My question is what is the best way to do this? Any reference to such research will be very helpful?


Solution

  • I would suggest three ways to tackle this problem, each with different strengths:

    • Instead of using the SENNA embeddings, try using FastText embeddings. The advantage here is that they can infer embeddings for OOV words using character n-grams. For the exact methodology used, check the associated paper. Gensim has implemented all the functionality needed. This will greatly reduce the problem, and you can further fine-tune the induced embeddings as you describe. The inconvenience is that you have to change from Senna to FastText.
    • Try using morphological or sementic similarity to initialize the OOV words. For morphological, I mean using a distance like Levenshtein to select an embedding. For an OOV word like apple, choose the closest (according to Levenshtein distance) word that you have an embeddings for, e.g., apples. In my experience, this can work remarkably well. On the other hand, semantic similarity would suggest using for instance synonyms, obtained from resources like WordNet or even averaging the embeddings of words that the OOV frequently co-occurs with.
    • After having reduced the sparsity with the ways described above, then proceed with or random initialization that is discussed in other responses.