Search code examples
cluster-analysisgensimword2vec

After loading a pretrained Word2Vec model, how do I get word2vec representations of new sentences?


I loaded a word2vec model using Google News dataset. Now I want to get the Word2Vec representations of a list of sentences that I wish to cluster. After going through the documentation I found this gensim.models.word2vec.LineSentencebut I'm not sure this is what I am looking for.

There should be a way to get word2vec representations of a list of sentences from a pretrained model right? None of the links I searched had anything about it. Any leads would be appreciated.


Solution

  • Word2Vec only offers vector representations for words, not sentences.

    One crude but somewhat effective (for some purposes) way to go from word-vectors to vectors for longer texts (like sentences) is to average all the word-vectors together. This isn't a function of the gensim Word2Vec class; you have to code this yourself.

    For example, with the word-vectors already loaded as word_model, you'd roughly do:

    import numpy as np
    
    sentence_tokens = "I do not like green eggs and ham".split()
    sum_vector = np.zeros(word_model.vector_size)
    for token in sentence_tokens:
        sum_vector += word_model[token]
    sentence_vector = sum_vector / len(sentence_tokens)
    

    Real code might add handling for when the tokens aren't all known to the model, or other ways of tokenizing/filtering the text, and so forth.

    There are other more sophisticated ways to get the vector for a length-of-text, such as the 'Paragraph Vectors' algorithm implemented by gensim's Doc2Vec class. These don't necessarily start with pretrained word-vectors, but can be trained on your own corpus of texts.