word2vec gensim cosine-similarity doc2vec

gensim model return ids not related with input doc2vec

I created a model from mongodb db news and I tagged the documents by mongo collection id

from gensim.models.doc2vec import TaggedDocument
i=0
docs=[]
for artical in lstcontent:
    doct = TaggedDocument(clean_str(artical), [lstids[i]])
    docs.append(doct)
    i+=1

after that I created the model by

pretrained_emb='tweet_cbow_300/tweets_cbow_300'
saved_path = "documentmodel/doc2vec_model.bin"
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model = g.Doc2Vec(docs, size=vector_size, window=window_size, min_count=min_count, sample=sampling_threshold, workers=worker_count, hs=0, dm=dm, negative=negative_size, dbow_words=1, dm_concat=1, pretrained_emb=pretrained_emb, iter=train_epoch)
model.save(saved_path)

when I using the model by the code :

import gensim.models as g
import codecs
model="documentmodel/doc2vec_model.bin"
start_alpha=0.01
infer_epoch=1000
m = g.Doc2Vec.load(model)
sims = m.docvecs.most_similar(['5aa94578094b4051695eeb10'])
sims

the output is

[('5aa944c1094b4051695eeaef', 0.9255372881889343),
('5aa945c1094b4051695eeb1d', 0.9222575426101685),
('5aa94584094b4051695eeb12', 0.9210859537124634),
('5aa945d2094b4051695eeb20', 0.9083569049835205),
('5aa945c7094b4051695eeb1e', 0.905883252620697),
('5aa9458f094b4051695eeb14', 0.9054019451141357),
('5aa944c7094b4051695eeaf0', 0.9019848108291626),
('5aa94589094b4051695eeb13', 0.9012798070907593),
('5aa945b1094b4051695eeb1a', 0.9000773429870605),
('5aa945bc094b4051695eeb1c', 0.8999895453453064)]

the ids not related with 5aa94578094b4051695eeb10 where is my proplem !?

Solution

It looks like you might be providing a string as the words of your TaggedDocument texts. It should be a list-of-words. (If you supply a string, it will see it as a list-of-single-character-words, and try to run the algorithm as character-to-character predictions – which won't lead to very good vectors.)

If you enable INFO level logging, and watch the output, you may see hints that this is the problem, in the form of a very small count of vocabulary words, dozens rather than tens-of-thousands. Or if that's not the problem, you may see other discrepancies that hint at what's going wrong.

Separate observations & tips:

you're using a 'pretrained_emb' argument that's not part of standard gensim. If you're using an unofficial variant, based on an older gensim, you might have other issues. Pretrained word embeddings are not necessary for Doc2Vec to work, and may not offer much benefit. (I would always try without any such extra complications, first, then only after you have a simple approach working as a baseline, try such added tweaks and always evaluate if they're really helping or not.)
it's unclear how many iter passes you're using, but 10-20 are typical values, perhaps more if your corpus is small and/or typical texts are short
dm=1, dm_concat=1 (PV-DM with a concatenative input layer) results in larger, slower models that may require much more data to become well-trained. It's not clear this dm_concat=1 mode is ever worth the trouble. At best, it should be considered experimental. So I would get things working without it, before perhaps trying it as an advanced experiment.