I have seen many tutorials online on how to use Word2Vec (gensim).
Most tutorials are showing on how to find the .most_similar
word or similarity between two words.
But, how if I have text data X
and I want to produce the word embedding vector X_vector
?
So that, this X_vector
can be used for classification algorithms?
If X
is a word (string token), you can look up its vector with word_model[X]
.
If X
is a text - say, a list-of-words – well, a Word2Vec
model only has vectors for words, not texts.
If you have some desired way to use a list-of-words plus per-word-vectors to create a text-vector, you should apply that yourself. There are many potential approaches, some simple, some complicated, but no one 'official' or 'best' way.
One easy popular baseline (a fair starting point especially on very small texts like titles) is to average together all the word vectors. That can be as simple as (assuming numpy
is imported as np
):
np.mean([word_model[word] for word in word_list], axis=0)
But, recent versions of Gensim also have a convenience .get_mean_vector()
method for averaging together sets of vectors (specified as their word-keys, or raw vectors), with some other options: