I loaded a word2vec model using Google News dataset. Now I want to get the Word2Vec representations of a list of sentences that I wish to cluster. After going through the documentation I found this gensim.models.word2vec.LineSentence
but I'm not sure this is what I am looking for.
There should be a way to get word2vec representations of a list of sentences from a pretrained model right? None of the links I searched had anything about it. Any leads would be appreciated.
Word2Vec only offers vector representations for words, not sentences.
One crude but somewhat effective (for some purposes) way to go from word-vectors to vectors for longer texts (like sentences) is to average all the word-vectors together. This isn't a function of the gensim Word2Vec
class; you have to code this yourself.
For example, with the word-vectors already loaded as word_model
, you'd roughly do:
import numpy as np
sentence_tokens = "I do not like green eggs and ham".split()
sum_vector = np.zeros(word_model.vector_size)
for token in sentence_tokens:
sum_vector += word_model[token]
sentence_vector = sum_vector / len(sentence_tokens)
Real code might add handling for when the tokens aren't all known to the model, or other ways of tokenizing/filtering the text, and so forth.
There are other more sophisticated ways to get the vector for a length-of-text, such as the 'Paragraph Vectors' algorithm implemented by gensim's Doc2Vec
class. These don't necessarily start with pretrained word-vectors, but can be trained on your own corpus of texts.