Search code examples
machine-learningnlpword2vecdoc2vec

Doc2vecC predicting vectors for unseen documents


I have trained a set of documents using Doc2vecc.

https://github.com/mchen24/iclr2017

I am trying to generate the embedding vector for the unseen documents.I have trained the documents as mentioned in the go.sh.

"""
time ./doc2vecc -train ./aclImdb/alldata-shuf.txt -word 
wordvectors.txt -output docvectors.txt -cbow 1 -size 100 -window 10 - 
negative 5 -hs 0 -sample 0 -threads 4 -binary 0 -iter 20 -min-count 10 
-test ./aclImdb/alldata.txt -sentence-sample 0.1 -save-vocab 
alldata.vocab
"""

I get the docvectors.txt and wordvectors.txt for the train set. Now from here how do I generate vectors for unseen test using the same model without retraining.


Solution

  • As far as I can tell, the author (https://github.com/mchen24) of that doc2vecc.c code (and paper) just made minimal changes to some example 'paragraph vector' code that was itself a minimal change to the original Google/Mikolov word2vec.c (https://github.com/tmikolov/word2vec/blob/master/word2vec.c).

    Neither the 'paragraph vector' changes nor the subsequent doc2vecc changes appear to include any functionality for inferring vectors for new documents.

    Because these are unsupervised algorithms, for some purposes it may be appropriate to calculate the document-vectors for some downstream classification task, for both training and test texts, in the same combined bulk training. (Your ultimate goals may in fact have unlabeled examples to help learn the document-vectorization, even if your classifier should be trained an evaluated on some subset of known-label texts.)