I'm trying create an algorithm that's capable of show the top n documents similar to a specific document. For that i used the gensim doc2vec. The code is bellow:
model = gensim.models.doc2vec.Doc2Vec(size=400, window=8, min_count=5, workers = 11,
dm=0,alpha = 0.025, min_alpha = 0.025, dbow_words = 1)
model.build_vocab(train_corpus)
for x in xrange(10):
model.train(train_corpus)
model.alpha -= 0.002
model.min_alpha = model.alpha
model.train(train_corpus)
model.save('model_EN_BigTrain')
sims = model.docvecs.most_similar([408], topn=10)
The sims var should give me 10 tuples, being the first element the id of the doc and the second the score. The problem is that some id's do not correspond to any document in my training data.
I've been trying for some time now to make sense out of the ids that aren't in my training data but i don't see any logic.
Ps: This is the code that i used to create my train_corpus
def readData(train_corpus, jData):
print("The response contains {0} properties".format(len(jData)))
print("\n")
for i in xrange(len(jData)):
print "> Reading offers from Aux array"
if i % 10 == 0:
print ">>", i, "offers processed..."
train_corpus.append(gensim.models.doc2vec.TaggedDocument(gensim.utils.simple_preprocess(jData[i][1]), tags=[jData[i][0]]))
print "> Finished processing offers"
Being each position of the aux array one array in witch the position 0 is an int (that i want to be the id) and the position 1 a description
Thanks in advance.
Are you using plain integer IDs as your tags
, but not using exactly all of the integers from 0 to whatever your MAX_DOC_ID
is?
If so, that could explain the appearance of tags within that range. When you use plain ints, gensim Doc2Vec avoids creating a dict mapping provided tags to index-positions in its internal vector-array – and just uses the ints themselves.
Thus that internal vector-array must be allocated to include MAX_DOC_ID + 1
rows. Any rows corresponding to unused IDs are still initialized as random vectors, like all the positions, but won't receive any of the training from actual text examples to push them into meaningful relative positions. It's thus possible these random-initialized-but-untrained vectors could appear in later most_similar()
results.
To avoid that, either use only contiguous ints from 0 to the last ID you need. Or, if you can afford the memory cost of the string-to-index mapping, use string tags instead of plain ints. Or, keep an extra record of the valid IDs and manually filter the unwanted IDs from results.
Separately: by not specifying iter=1
in your Doc2Vec model initialization, the default of iter=5
will be in effect, meaning each call to train()
does 5 iterations over your data. Oddly, also, your xrange(10)
for-loop includes two separate calls to train()
each iteration (and the 1st is just using whatever alpha/min_alpha was already in place). So you're actually doing 10 * 2 * 5 = 100 passes over the data, with an odd learning-rate schedule.
I suggest instead if you want 10 passes to just set iter=10
, leave default alpha
/min_alpha
untouched, and then call train()
only once. The model will do 10 passes, smoothly managing alpha from its starting to ending values.