I've created an artificial corpus (with 52624 documents). Each document is a list of objects (there are 461 of them).
So one possibility could be: ['chair', 'chair', 'chair', 'chair', 'chair', 'table', 'table']
Here's a bar plot (log-scale) of the vocabulary.
And this is how I defined the model:
model = gensim.models.doc2vec.Doc2Vec(vector_size=8, workers=4, min_count=1,epochs=40, dm=0)
Observing at:
model.wv.most_similar_cosmul(positive = ["chair"])
I see non related words
And it seems to me that the following works poorly as well:
inferred_vector = model.infer_vector(["chair"])
model.docvecs.most_similar([inferred_vector])
Where has my model failed?
UPDATE
There is the data (JSON file):
Yes, Doc2Vec
& Word2Vec
are often tried, and useful, on synthetic data. But whether they work may require a lot more tinkering, and atypical parameters, when the data doesn't reflect the same sort of correlations/distributions as the natural-language on which these algorithms were 1st developed.
First and foremost with your setup, you're using the dm=0
mode. That's the PV-DBOW
mode of the original "Paragraph Vector" paper, which specifically does not train word-vectors at all, only the doc-vectors. So if you're testing such a model by looking at word-vectors, your results will only reflect the random, untrained initialization values of any word-vectors.
Check the model.docvecs
instead, for similarities between any doc-tags you specified in your data, and their may be more useful relationships.
(If you want your Doc2Vec
model to learn words too – which isn't necessarily important, especially with a small dataset or where the doc-vectors are your main interest – you'd have to use either dm=1
mode, or add dbow_words=1
to dm=0
so that the model adds interleaved skip-gram training. But note word-vector training may be weak/meaningless/harmful with data that's looks like it's just sorted runs of repeating tokens, as with your ['chair', 'chair', 'chair', 'chair', 'chair', 'table', 'table']
example item.)
Separately, using a very-low min_count=1
is often a bad idea in such models - as such tokens with arbitrary idiosyncractic non-representative appearances do more damage to the coherence of surrounding more-common tokens than they help.