Search code examples
machine-learningnlpgensimword2vecdoc2vec

Can doc2vec work on an artificial "text"?


I've created an artificial corpus (with 52624 documents). Each document is a list of objects (there are 461 of them).

So one possibility could be: ['chair', 'chair', 'chair', 'chair', 'chair', 'table', 'table']

Here's a bar plot (log-scale) of the vocabulary.

enter image description here

And this is how I defined the model:

model = gensim.models.doc2vec.Doc2Vec(vector_size=8, workers=4, min_count=1,epochs=40, dm=0)

Observing at: model.wv.most_similar_cosmul(positive = ["chair"])

I see non related words

And it seems to me that the following works poorly as well:

inferred_vector = model.infer_vector(["chair"])
model.docvecs.most_similar([inferred_vector])

Where has my model failed?

UPDATE

There is the data (JSON file):

https://gofile.io/d/bZDcPX


Solution

  • Yes, Doc2Vec & Word2Vec are often tried, and useful, on synthetic data. But whether they work may require a lot more tinkering, and atypical parameters, when the data doesn't reflect the same sort of correlations/distributions as the natural-language on which these algorithms were 1st developed.

    First and foremost with your setup, you're using the dm=0 mode. That's the PV-DBOW mode of the original "Paragraph Vector" paper, which specifically does not train word-vectors at all, only the doc-vectors. So if you're testing such a model by looking at word-vectors, your results will only reflect the random, untrained initialization values of any word-vectors.

    Check the model.docvecs instead, for similarities between any doc-tags you specified in your data, and their may be more useful relationships.

    (If you want your Doc2Vec model to learn words too – which isn't necessarily important, especially with a small dataset or where the doc-vectors are your main interest – you'd have to use either dm=1 mode, or add dbow_words=1 to dm=0 so that the model adds interleaved skip-gram training. But note word-vector training may be weak/meaningless/harmful with data that's looks like it's just sorted runs of repeating tokens, as with your ['chair', 'chair', 'chair', 'chair', 'chair', 'table', 'table'] example item.)

    Separately, using a very-low min_count=1 is often a bad idea in such models - as such tokens with arbitrary idiosyncractic non-representative appearances do more damage to the coherence of surrounding more-common tokens than they help.