Search code examples
doc2vec

Why result in doc2vec is wrong with same tokenize word list?


I'm using Doc2vec model. I pre-train model with dataset which contains more than 20K articles in Wikipedia. After that, I try to test result by calculate similarity between two sentences.

I have two sentences: 1. The process of searching for a job can be very stressful. 2. The job search process can be very stressful.

After I preprocess and tokenize I have list of words for sentence 1 is list_1 = ['process', 'search', 'job', 'stress'] and for sentence 2 is list_2 = ['job', 'search', 'process', 'stress']. But when after I use vec_1 = doc2vec_model.infer_vector(list_1) and vec_2 = doc2vec_model.infer_vector(list_2). I usegensim.matutils.full2sparse and gensim.matutils.cossim to caculate similarity cossim.

I got result near 0 value like 0.00709335870. It seems not right. I think the result should be near 1.

What is my problem and how I fix this error?

This is a part of my code:

//model.tokenize_word(data['document_1'] is  ['process', 'search', 'job', 'stress']
    vec_1 = doc2vec_model.infer_vector(model.tokenize_word(data['document_1'])) 
    doc2vec_model.random.seed(0)
    
// model.tokenize_word(data['document_2'] is  ['job', 'search', 'process', 'stress']
    vec_2 = doc2vec_model.infer_vector(model.tokenize_word(data['document_2']))
    vec_1 = gensim.matutils.full2sparse(vec_1)
    vec_2 = gensim.matutils.full2sparse(vec_2)

    similarity = gensim.matutils.cossim(vec_1, vec_2)
    print(similarity) // 0.00709335870

Solution

  • You haven't shown how you ran your Doc2Vec training; something may have gone wrong there. If the exact same set of 4 words gives very-different infer_vector() results – as opposed to just a little different results, as is normal with this stochastic algorithm – some problems might be:

    • none of the words are in the model
    • the model never underwent real training, perhaps due to some bug in the training code or supplied corpus
    • the model has atypical/inappropriate parameters

    I suggest:

    • set logging to the INFO level, and re-run your training, watching the output logs carefully. Verify that training takes time, & reports sensible values for the number of unique words in the model's vocabulary & total words in your corpus, & doesn't show errors/warnings
    • when training is done, check that values like len(doc2vec_model.dv) and len(doc2vec_model.wv) are sensible for the expected number of documents and known words

    If you're then still having problems, expand your question text to show code & parameters you used to initialize & train the Doc2Vec model, and some meaningful excerpts from the logging that convinced you things were otherwise working, and some details of the size of your corpus (like total word count, total doc count, and and average words per document).

    Also note: