Search code examples
python-3.xgensimdoc2vec

Doc2Vec model splits documents tags in symbols


I'm using gensim 3.0.1.

I have a list of TaggedDocument with unique labels of the form "label_17", but when I train Doc2Vec model, it somehow splits the labels to symbols, so the output for model.docvecs.doctags is the following:

{'0': Doctag(offset=5, word_count=378, doc_count=40),
 '1': Doctag(offset=6, word_count=1330, doc_count=141),
 '2': Doctag(offset=7, word_count=413, doc_count=50),
 '3': Doctag(offset=8, word_count=365, doc_count=41),
 '4': Doctag(offset=9, word_count=395, doc_count=41),
 '5': Doctag(offset=10, word_count=420, doc_count=41),
 '6': Doctag(offset=11, word_count=408, doc_count=41),
 '7': Doctag(offset=12, word_count=426, doc_count=41),
 '8': Doctag(offset=13, word_count=385, doc_count=41),
 '9': Doctag(offset=14, word_count=376, doc_count=40),
 '_': Doctag(offset=4, word_count=2009, doc_count=209),
 'a': Doctag(offset=1, word_count=2009, doc_count=209),
 'b': Doctag(offset=2, word_count=2009, doc_count=209),
 'e': Doctag(offset=3, word_count=2009, doc_count=209),
 'l': Doctag(offset=0, word_count=4018, doc_count=418)}

but in the initial list of tagged document each document has its own unique label.

The code for model training is the following:

model = Doc2Vec(size=300, sample=1e-4, workers=2)
print('Building Vocabulary')
model.build_vocab(data)
print('Training...')
model.train(data, total_words=total_words_count, epochs=20)

Therefore I can't index my documents like model.docvecs['label_17'] and get KeyError.

The same thing if I pass data to the constructor instead of building the vocabulary.

Why is this happening? Thanks.


Solution

  • Doc2Vec expects the text examples, objects of the shape TaggedDocument, to have a tags property that's a list-of-tags.

    If you instead supply a string, like 'label_17', it is actually a *list-of-characters*, so it's essentially saying thatTaggedDocument` has tags:

    ['l', 'a', 'b', 'e', 'l', '_', '1', '7']
    

    Make sure you make tags a list-of-one-tag, for example tags=['label_17'], and you should see results in terms of trained-tags more like what you expect.

    Separately: it appears you have about 200 documents, of about 10 words each. Note Word2Vec/Doc2Vec need large, varied datasets to get good results. In particular with just 200 texts but 300 vector-dimensions, the training can get quite good at the training task (internal word prediction) with little more than memorizing the idiosyncracies of the training set, which is essentially 'overfitting' and does not result in vectors whose distances/arrangement represent generalizable knowledge that would transfer to other examples.