I have a pandas
dataframe containing descriptions. I would like to cluster descriptions based on meanings usign CBOW
. My challenge for now is to document embed each row into equal dimensions vectors. At first I am training the word vectors using gensim
as so:
from gensim.models import Word2Vec
vocab = pd.concat((df['description'], df['more_description']))
model = Word2Vec(sentences=vocab, size=100, window=10, min_count=3, workers=4, sg=0)
I am however a bit confused now on how to replace the full sentences from my df
with document vectors of equal dimensions.
For now, my workaround is repacing each word in each row with a vector then applying PCA dimentinality reduction to bring each vector to similar dimensions. Is there a better way of doing this though gensim
, so that I could say something like this:
df['description'].apply(model.vectorize)
I think you are looking for sentence embedding. There are a lot ways of generating sentence embedding from word embeddings. You may find this useful: https://stats.stackexchange.com/questions/286579/how-to-train-sentence-paragraph-document-embeddings