Search code examples
pythonmachine-learningnlpgensimdoc2vec

Cannot align graph because multiple tag doc2vec returning more items in doctag_syn0 than there are in the training data


I am training a doc2vec model with multiple tags, so it includes the typical doc "ID" tag and then it also contains a label tag "Category 1." I'm trying to graph the results such that I get the doc distribution in a 2d (using LargeVis) but am able to color different tags. My problem is that the vectors the model returns exceed the number of training observations by 5 making difficult to align the original tags with the vectors:

In[1]: data.shape 
Out[1]: (17717,5)

Training the model on 100 parameters

In[2]: model.docvecs.doctag_syn0.shape
Out[2]: (17722,100) 

I have no idea whether the 5 additional observations shift the order of the vectors or whether they're just appended to the end. I want to avoid using string tags for the doc IDs because I am preparing this code to use on a much larger dataset. I found an explanation in a google group https://groups.google.com/forum/#!topic/gensim/OdvQkwuADl0 which explained that using multiple tags per doc can result in this type of output. However, I haven't been able to find a way to avoid or correct it in any forum or documentation.


Solution

  • The number of doc-vectors learned will be equal to the number of unique tags you've supplied. It looks like perhaps you've supplied 17,717 unique-IDs and then 5 extra repeating category-tags. Thus, there are 17,722 total known doc-tags (and thus corresponding learned doc-vectors). So, this is expected behavior.

    If you need to pass just the 17,717 per-doc vectors to some other process (like a dimensionality-reduction to 2-d), you'll have to pull them out of the model. You could pull them out 1-by-1 – model.docvecs[doc_id] – and put them into whatever form the next step needs.

    If your doc-IDs happen to have been plain ints, from 0 to 17,716, then they will in fact be the first 17,716 entries in the model.docvecs.doctag_syn0 array, which might make things easier - you may just be able to use a view into that array. (The last five rows will be the string tags.)

    I would suggest doing all your steps first without the extra complication of adding the secondary category string tags. Such extra tags may help or hurt vector-usefulness for downstream tasks in different situations, but definitely (as you've seen) make things a bit more complicated. So getting baseline results and outputs, without that complication, may be helpful.