I am training a doc2vec model with multiple tags, so it includes the typical doc "ID" tag and then it also contains a label tag "Category 1." I'm trying to graph the results such that I get the doc distribution in a 2d (using LargeVis) but am able to color different tags. My problem is that the vectors the model returns exceed the number of training observations by 5 making difficult to align the original tags with the vectors:
In[1]: data.shape
Out[1]: (17717,5)
Training the model on 100 parameters
In[2]: model.docvecs.doctag_syn0.shape
Out[2]: (17722,100)
I have no idea whether the 5 additional observations shift the order of the vectors or whether they're just appended to the end. I want to avoid using string tags for the doc IDs because I am preparing this code to use on a much larger dataset. I found an explanation in a google group https://groups.google.com/forum/#!topic/gensim/OdvQkwuADl0 which explained that using multiple tags per doc can result in this type of output. However, I haven't been able to find a way to avoid or correct it in any forum or documentation.
The number of doc-vectors learned will be equal to the number of unique tags you've supplied. It looks like perhaps you've supplied 17,717 unique-IDs and then 5 extra repeating category-tags. Thus, there are 17,722 total known doc-tags (and thus corresponding learned doc-vectors). So, this is expected behavior.
If you need to pass just the 17,717 per-doc vectors to some other process (like a dimensionality-reduction to 2-d), you'll have to pull them out of the model. You could pull them out 1-by-1 – model.docvecs[doc_id]
– and put them into whatever form the next step needs.
If your doc-IDs happen to have been plain ints, from 0 to 17,716, then they will in fact be the first 17,716 entries in the model.docvecs.doctag_syn0
array, which might make things easier - you may just be able to use a view into that array. (The last five rows will be the string tags.)
I would suggest doing all your steps first without the extra complication of adding the secondary category string tags. Such extra tags may help or hurt vector-usefulness for downstream tasks in different situations, but definitely (as you've seen) make things a bit more complicated. So getting baseline results and outputs, without that complication, may be helpful.