This is the version of gensim I am using:
Name: gensim
Version: 4.3.0
Summary: Python framework for fast Vector Space Modelling
Home-page: http://radimrehurek.com/gensim
Author: Radim Rehurek
Author-email: me@radimr
I want to convert sentences into vectors using Word2Vec
. So is there any other method than infer_vector
that converts a sentence into a vector. [Using Word2Vec
is a compulsion]
Current code:
In:clean_data[:3]
Out:[['good'],
['nice'],
['its',
'ok',
'but',
'still',
'not',
'work',
'some',
'times',
'please',
'upgrade',
'a',
'valuable',
'process']]
In:from gensim.models import Word2Vec
In:model= Word2Vec(clean_data, vector_size=100, min_count=2, sg=1)
In:model.train(clean_data,total_examples=model.corpus_count,epochs=model.epochs)
In:model.infer_vector(['its','ok','but','still','not','work','some','times','please','upgrade','a','valuable','process'])
Error:
AttributeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_11408/92733804.py in <module>
----> 1 model.infer_vector(['its','ok','but','still','not','work','some','times','please','upgrade','a','valuable','process'])
AttributeError: 'Word2Vec' object has no attribute 'infer_vector'
.infer_vector()
is only available on the Doc2Vec
model, Its underlying algorithm, "Paragraph Vectors", describes a standard way to learn fixed-length vectors associated with multi-word texts. The Doc2Vec
class follows that algorithm, first during bulk training, than as an option in the frozen trained model via the .infer_vector()
method.
Word2Vec
, on the other hand, is a model only for learning vectors for individual words. As an algorithm, word2vec says nothing about what a vector for a multi-word text should be.
Many people choose to use the average of all a multi-word text's individual words as a simple vector for the text as a whole. It's quick & easy to calculate, but fairly limited in its power. Still, for some applications, especially broad topical-classifications that don't rely on any sort of grammatical/ordering understanding, such text-vectors work OK – especially as a starting baseline against which to compare additional techniques.
Gensim's KeyedVectors
class, which is how the Word2Vec
model stores its learned word-vectors inside its .wv
property, has a utility method to help calculation the mean (aka average) of multiple word-vectors. Its documentation is here:
You could use it with a list-of-words like so:
multiword_average_vector = model.wv.get_mean_vector([
'its','ok','but','still','not','work','some',
'times','please','upgrade','a','valuable','process'
])
Note that it will by default ignore any words not present in the model, but if you'd prefer it to raise an error, you can use the optional ignore_missing=True
parameter.
Separately: note that tiny toy-sized uses of Word2Vec
generally won't show any useful properties & may mislead you about how the algorithm works on the larger datasets for which it is most valuable. You generally will want to train on corpuses of at least hundreds-of-thousands (if not millions) of words, to create vocabularies with at least tens-of-thousands of known words (each with many contrasting realistic usage examples in your training data), in order to see the real behavior/value of this algorithm.