Doc2Vec infer_vector not working as expected

The program should be returning the second text in the list for most similar, as it is same word to word. But its not the case here.

import gensim
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument


data = ["I love machine learning. Its awesome.",
        "I love coding in python",
        "I love building chatbots",
        "they chat amagingly well"]


tagged_data=[TaggedDocument(word_tokenize(_d.lower()),tags=[str(i)]) for i,_d in enumerate(data)]

max_epochs = 100
vec_size = 20
alpha = 0.025

model = Doc2Vec(size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                negative=0,
                dm =1)

model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    #print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

model.save("d2v.model")


loaded_model=Doc2Vec.load("d2v.model")
test_data=["I love coding in python".lower()]

v1=loaded_model.infer_vector(test_data)

similar_doc=loaded_model.docvecs.most_similar([v1])
print similar_doc

Output:

[('0', 0.17585766315460205), ('2', 0.055697083473205566), ('3', -0.02361609786748886), ('1', -0.2507985532283783)]

Its showing the first text in the list as most similar instead of the second text. Can you please help with this ?

Solution

First, you won't get good results from Doc2Vec-style models with toy-sized datasets. Just four documents, and a vocabulary of about 20 unique words, can't create a meaningfully-contrasting "dense embedding" vector model full of 20-dimensional vectors.

Second, if you set negative=0 in your model initialization, you're disabling the default model-training-correction mode (negative=5) – and you're not enabling the non-default, less-recommended alternative (hs=1). No training at all will be occurring. There may also be an error shown in the code output – but also, if you're running with at least INFO-level logging, you might notice other issues in the output.

Third, infer_vector() requires a list-of-word-tokens as its argument. You're providing a plain string. That will look like a list of one-character words to the code, so it's like you're asking it to infer on the 23-word sentence:

['i', ' ', 'l', 'o', 'v', 'e', ' ', 'c', ...]

The argument to infer_vector() should be tokenized exactly the same as the training texts were tokenized. (If you used word_tokenize() during training, use it during inference, too.)

infer_vector() will also use a number of repeated inference-passes over the text that's equal to the 'epochs' value inside the Doc2Vec model, unless you specify another value. Since you didn't specify an epochs, the model will still have its default value (inherited from Word2Vec) of epochs=5. Most Doc2Vec work uses 10-20 epochs during training, and using at least as many during inference seems a good practice.

But also:

Don't try to call train() more than once in a loop, or manage alpha in your own code, unless you are an expert.

Whatever online example suggested a code block like your...

for epoch in range(max_epochs):
    #print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

...is a bad example. It's sending the effective alpha rate down-and-up incorrectly, it's very fragile if you ever want to change the number of epochs, it actually winds up running 500 epochs (100 * model.iter), it's far more code than is necessary.

Instead, don't change default alpha options, and specify your desired number of epochs when the model is created. So, the model will have a meaningful epochs value cached to be used by a later infer_vector().

Then, only call train() once. It will handle all epochs & alpha-management correctly. For example:

model = Doc2Vec(size=vec_size,
                min_count=1,  # not good idea w/ real corpuses but OK
                dm=1,  # not necessary to specify since it's the default but OK  
                epochs=max_epochs)
model.build_vocab(tagged_data)
model.train(tagged_data, 
            total_examples=model.corpus_count, 
            epochs=model.epochs)