Search code examples
machine-learninggensimword-embeddingdoc2vec

Gensim doc2vec produce more vectors than given documents, when I pass unique integer id as tags


I'm trying to make documents vectors of gensim example using doc2vec. I passed TaggedDocument which contains 9 docs and 9 tags.

from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
idx = [0,1,2,3,4,5,6,7,100]
documents = [TaggedDocument(doc, [i]) for doc, i in zip(common_texts, idx)]
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)

and it produces 101 vectors like this image. gensim doc2vec produced 101 vectors

and what I want to know is

  1. How can I be sure that the tag I passed is attached to the right vector?
  2. How did the vectors with the tags which I didn't pass (8~99 in my case) come out? Were they computed as a blank?

Solution

  • If you use plain ints as your document-tags, then the Doc2Vec model will allocate enough doc-vectors for every int up to the highest int you provide - even if you don't use some of those ints.

    This assumption, that all ints up to the highest declared are used, allows the code to avoid creating a redundant {tag -> slot} dictionary, saving a little memory. That specific potential savings is the main reason for supporting plain ints (rather than unique strings) as tag names.

    Any such doc-vectors allocated but never subject to any traiing will be randomly-initialized the same as others - but never adjusted by training.

    If you want to use plain int tag names, you should either be comfortable with this over-allocation, or make sure you only use all contiguous int IDs from 0 to your max ID, with none ununused. But unless your training data is very large, using unique string tags, and allowing the {tag -> slot} dictionary to be created, is straightforward and not too expensive in memory.

    (Separately: min_count=1 is almost always a bad idea in these algorithms, as discarding rare tokens tends to give better results than letting their thin example usages interfere with other training.)