Search code examples
gensimword2vec

Word2Vec convert a sentence


I have trained a Word2Vec model using gensim, I have a dataset of tweets that I would like to convert to vectors. What is the best way to convert a sentence to a vector + how can this be done using a word2vec model.


Solution

  • Formally, the word2vec algorithm only gives you a vector per word, not per longer text (like a sentence or paragraph or tweet or article).

    One quick & easy baseline approach for turning longer texts into vectors is to just average together the vectors of each word. Recent versions of Gensim have a helper method get_mean_vector() to do this on KeyedVectors model objects (sets-of-word-vectors):

    text_vector = kv_model.get_mean_vector(list_of_words)
    

    Of course, such a simpleminded average has no way to model the effects of word-order/grammar. Words may tend to cancel each other out rather than have the compositional effects of real language, and the space of possible multiword-text meanings is much larger than the space of single-word meanings – so just collapsing the text into the same coordinate system as words may lose a lot.

    More sophisticated ways of vectorizing text rely on model far more more sophisticated than plain word2vec, such as deep/recurrent neural networks for modelling longer ranges of text.