python vectorization word2vec word-embedding

'Word2Vec' object has no attribute 'infer_vector'

This is the version of gensim I am using:

Name: gensim
Version: 4.3.0
Summary: Python framework for fast Vector Space Modelling
Home-page: http://radimrehurek.com/gensim
Author: Radim Rehurek
Author-email: me@radimr

I want to convert sentences into vectors using Word2Vec. So is there any other method than infer_vector that converts a sentence into a vector. [Using Word2Vec is a compulsion]

Current code:

In:clean_data[:3]
Out:[['good'],
 ['nice'],
 ['its',
  'ok',
  'but',
  'still',
  'not',
  'work',
  'some',
  'times',
  'please',
  'upgrade',
  'a',
  'valuable',
  'process']]
In:from gensim.models import Word2Vec

In:model= Word2Vec(clean_data, vector_size=100, min_count=2, sg=1)

In:model.train(clean_data,total_examples=model.corpus_count,epochs=model.epochs)

In:model.infer_vector(['its','ok','but','still','not','work','some','times','please','upgrade','a','valuable','process'])

Error:

AttributeError                            Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_11408/92733804.py in <module>
----> 1 model.infer_vector(['its','ok','but','still','not','work','some','times','please','upgrade','a','valuable','process'])

AttributeError: 'Word2Vec' object has no attribute 'infer_vector'

Solution

.infer_vector() is only available on the Doc2Vec model, Its underlying algorithm, "Paragraph Vectors", describes a standard way to learn fixed-length vectors associated with multi-word texts. The Doc2Vec class follows that algorithm, first during bulk training, than as an option in the frozen trained model via the .infer_vector() method.

Word2Vec, on the other hand, is a model only for learning vectors for individual words. As an algorithm, word2vec says nothing about what a vector for a multi-word text should be.

Many people choose to use the average of all a multi-word text's individual words as a simple vector for the text as a whole. It's quick & easy to calculate, but fairly limited in its power. Still, for some applications, especially broad topical-classifications that don't rely on any sort of grammatical/ordering understanding, such text-vectors work OK – especially as a starting baseline against which to compare additional techniques.

Gensim's KeyedVectors class, which is how the Word2Vec model stores its learned word-vectors inside its .wv property, has a utility method to help calculation the mean (aka average) of multiple word-vectors. Its documentation is here:

https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.get_mean_vector

You could use it with a list-of-words like so:

multiword_average_vector = model.wv.get_mean_vector([
        'its','ok','but','still','not','work','some',
        'times','please','upgrade','a','valuable','process'
    ])

Note that it will by default ignore any words not present in the model, but if you'd prefer it to raise an error, you can use the optional ignore_missing=True parameter.

Separately: note that tiny toy-sized uses of Word2Vec generally won't show any useful properties & may mislead you about how the algorithm works on the larger datasets for which it is most valuable. You generally will want to train on corpuses of at least hundreds-of-thousands (if not millions) of words, to create vocabularies with at least tens-of-thousands of known words (each with many contrasting realistic usage examples in your training data), in order to see the real behavior/value of this algorithm.