How to get vectors for each document using Google News Word2Vec

I am trying out Google's word2vec pre-trained model to get word embeddings. I am able to load the model in my code and I can see that I get a 300-dimensional representation of a word. Here is the code -

import gensim
from gensim import models
from gensim.models import Word2Vec
model = gensim.models.KeyedVectors.load_word2vec_format('/Downloads/GoogleNews-vectors-negative300.bin', binary=True)
dog = model['dog']
print(dog.shape)

which gives me below output -

>>> print(dog.shape)
(300,)

This works but I am interested in obtaining a vector representation for entire document and not just one word. How can I do it using word2vec model ?

dog_sentence = model['it is a cute little dog']
KeyError: "word 'it is a cute little dog' not in vocabulary"

I plan to apply these on many documents and then train a clustering model on topic of it to do unsupervised learning and topic modeling.

Solution

That's a set of word-vectors. There's no single canonical way to turn word-vectors into vectors for longer runs of text, like sentences or documents.

You can try simply averaging the word-vectors for each word in the text. (To do this, you wouldn't pass the whole string text, but break it into words, look up each word-vector, then average all those vectors.)

That's quick and simple to calculate, and works OK as a baseline for some tasks, especially topical-analyses on very-short texts. But as it takes no account of grammar/word-order, and dilutes all words with all others, it's often outperformed by more sophisticated analyses.

Note also: that set of word-vectors was calculated by Google around 2013, from news articles. It will miss words and word-senses that have arisen since then, and its vectors will be flavored by the way news-articles are written - very different from other domains of language. If you have enough data, training your own word-vectors, on your own domain's texts, may outperform them in both word-coverage and vector-relevance.