Search code examples
nlpstanford-nlpgensimword2vec

How to convert small dataset into word embeddings instead of one-hot encoding?


I have a dataset of 33 words that are a mix of verbs and nouns, for eg. father, sing, etc. I have tried converting them to 1-hot encoding but for my use case, it has been suggested to look into word2vec embedding. I have looked in gensim and glove but struggling to make it work.

How could I convert my data into an embedding? Such that two words that may be semantically closer may have a lesser distance between their respective vectors. How may this be achieved or any helpful material on the same?

Such as thisembedding


Solution

  • Since your dataset is quite small, and I'm assuming it doesn't contain any jargon, it's best to use a pre-trained model in order to save up on training time.

    With gensim, it's as simple as:

    import gensim.downloader as api
    wv = api.load('word2vec-google-news-300')
    

    The 'word2vec-google-news-300' model has been pre-trained on a part of the Google News Dataset and generalizes well enough to most tasks. Following this, you can create word embeddings/vectors like so:

    vec = wv['father']
    

    And, finally, for computing word similarity:

    similarity_score = wv.similarity('father', 'sing')
    

    Lastly, one major limitation of Word2Vec is it's inability to deal with words that are OOV(out of vocabulary). For such cases, it's best to train a custom model for your corpus.