How to convert small dataset into word embeddings instead of one-hot encoding?

I have a dataset of 33 words that are a mix of verbs and nouns, for eg. father, sing, etc. I have tried converting them to 1-hot encoding but for my use case, it has been suggested to look into word2vec embedding. I have looked in gensim and glove but struggling to make it work.

How could I convert my data into an embedding? Such that two words that may be semantically closer may have a lesser distance between their respective vectors. How may this be achieved or any helpful material on the same?

Such as this embedding

Solution

Since your dataset is quite small, and I'm assuming it doesn't contain any jargon, it's best to use a pre-trained model in order to save up on training time.

With gensim, it's as simple as:

import gensim.downloader as api
wv = api.load('word2vec-google-news-300')

The 'word2vec-google-news-300' model has been pre-trained on a part of the Google News Dataset and generalizes well enough to most tasks. Following this, you can create word embeddings/vectors like so:

vec = wv['father']

And, finally, for computing word similarity:

similarity_score = wv.similarity('father', 'sing')

Lastly, one major limitation of Word2Vec is it's inability to deal with words that are OOV(out of vocabulary). For such cases, it's best to train a custom model for your corpus.