I'm using gensim 3.0.1
.
I have a list of TaggedDocument
with unique labels of the form "label_17"
, but when I train Doc2Vec model, it somehow splits the labels to symbols, so the output for model.docvecs.doctags
is the following:
{'0': Doctag(offset=5, word_count=378, doc_count=40),
'1': Doctag(offset=6, word_count=1330, doc_count=141),
'2': Doctag(offset=7, word_count=413, doc_count=50),
'3': Doctag(offset=8, word_count=365, doc_count=41),
'4': Doctag(offset=9, word_count=395, doc_count=41),
'5': Doctag(offset=10, word_count=420, doc_count=41),
'6': Doctag(offset=11, word_count=408, doc_count=41),
'7': Doctag(offset=12, word_count=426, doc_count=41),
'8': Doctag(offset=13, word_count=385, doc_count=41),
'9': Doctag(offset=14, word_count=376, doc_count=40),
'_': Doctag(offset=4, word_count=2009, doc_count=209),
'a': Doctag(offset=1, word_count=2009, doc_count=209),
'b': Doctag(offset=2, word_count=2009, doc_count=209),
'e': Doctag(offset=3, word_count=2009, doc_count=209),
'l': Doctag(offset=0, word_count=4018, doc_count=418)}
but in the initial list of tagged document each document has its own unique label.
The code for model training is the following:
model = Doc2Vec(size=300, sample=1e-4, workers=2)
print('Building Vocabulary')
model.build_vocab(data)
print('Training...')
model.train(data, total_words=total_words_count, epochs=20)
Therefore I can't index my documents like model.docvecs['label_17']
and get KeyError
.
The same thing if I pass data to the constructor instead of building the vocabulary.
Why is this happening? Thanks.
Doc2Vec
expects the text examples, objects of the shape TaggedDocument
, to have a tags
property that's a list-of-tags.
If you instead supply a string, like 'label_17'
, it is actually a *list-of-characters*, so it's essentially saying that
TaggedDocument` has tags:
['l', 'a', 'b', 'e', 'l', '_', '1', '7']
Make sure you make tags
a list-of-one-tag, for example tags=['label_17']
, and you should see results in terms of trained-tags more like what you expect.
Separately: it appears you have about 200 documents, of about 10 words each. Note Word2Vec
/Doc2Vec
need large, varied datasets to get good results. In particular with just 200 texts but 300 vector-dimensions, the training can get quite good at the training task (internal word prediction) with little more than memorizing the idiosyncracies of the training set, which is essentially 'overfitting' and does not result in vectors whose distances/arrangement represent generalizable knowledge that would transfer to other examples.