I have trained a Word2Vec model using gensim, I have a dataset of tweets that I would like to convert to vectors. What is the best way to convert a sentence to a vector + how can this be done using a word2vec model.
Formally, the word2vec algorithm only gives you a vector per word, not per longer text (like a sentence or paragraph or tweet or article).
One quick & easy baseline approach for turning longer texts into vectors is to just average together the vectors of each word. Recent versions of Gensim have a helper method get_mean_vector()
to do this on KeyedVectors
model objects (sets-of-word-vectors):
text_vector = kv_model.get_mean_vector(list_of_words)
Of course, such a simpleminded average has no way to model the effects of word-order/grammar. Words may tend to cancel each other out rather than have the compositional effects of real language, and the space of possible multiword-text meanings is much larger than the space of single-word meanings – so just collapsing the text into the same coordinate system as words may lose a lot.
More sophisticated ways of vectorizing text rely on model far more more sophisticated than plain word2vec, such as deep/recurrent neural networks for modelling longer ranges of text.