Search code examples
python-3.xgensimword2vecword-embeddingdoc2vec

How to sentence embed from gensim Word2Vec embedding vectors?


I have a pandas dataframe containing descriptions. I would like to cluster descriptions based on meanings usign CBOW. My challenge for now is to document embed each row into equal dimensions vectors. At first I am training the word vectors using gensim as so:

from gensim.models import Word2Vec

vocab = pd.concat((df['description'], df['more_description']))
model = Word2Vec(sentences=vocab, size=100, window=10, min_count=3, workers=4, sg=0)

I am however a bit confused now on how to replace the full sentences from my df with document vectors of equal dimensions.

For now, my workaround is repacing each word in each row with a vector then applying PCA dimentinality reduction to bring each vector to similar dimensions. Is there a better way of doing this though gensim, so that I could say something like this:

df['description'].apply(model.vectorize)

Solution

  • I think you are looking for sentence embedding. There are a lot ways of generating sentence embedding from word embeddings. You may find this useful: https://stats.stackexchange.com/questions/286579/how-to-train-sentence-paragraph-document-embeddings