I am building a NLP chat application in Python using gensim
library through doc2vec
model. I have hard coded documents and given a set of training examples, I am testing the model by throwing a user question and then finding most similar documents as a first step. In this case my test question is an exact copy of a document from training example.
import gensim
from gensim import models
sentence = models.doc2vec.LabeledSentence(words=[u'sampling',u'what',u'is',u'tell',u'me',u'about'],tags=["SENT_0"])
sentence1 = models.doc2vec.LabeledSentence(words=[u'eligibility',u'what',u'is',u'my',u'limit',u'how',u'much',u'can',u'I',u'claim'],tags=["SENT_1"])
sentence2 = models.doc2vec.LabeledSentence(words=[u'eligibility',u'I',u'am',u'retiring',u'how',u'much',u'can',u'claim',u'have', u'resigned'],tags=["SENT_2"])
sentence3 = models.doc2vec.LabeledSentence(words=[u'what',u'is',u'my',u'eligibility',u'post',u'my',u'promotion'],tags=["SENT_3"])
sentence4 = models.doc2vec.LabeledSentence(words=[u'what',u'is', u'my',u'eligibility' u'post',u'my',u'promotion'], tags=["SENT_4"])
sentences = [sentence, sentence1, sentence2, sentence3, sentence4]
class LabeledLineSentence(object):
def __init__(self, filename):
self.filename = filename
def __iter__(self):
for uid, line in enumerate(open(filename)):
yield LabeledSentence(words=line.split(), labels=['SENT_%s' % uid])
model = models.Doc2Vec(alpha=0.03, min_alpha=.025, min_count=2)
model.build_vocab(sentences)
for epoch in range(30):
model.train(sentences, total_examples=model.corpus_count, epochs = model.iter)
model.alpha -= 0.002 # decrease the learning rate`
model.min_alpha = model.alpha # fix the learning rate, no decay
model.save("my_model.doc2vec")
model_loaded = models.Doc2Vec.load('my_model.doc2vec')
print (model_loaded.docvecs.most_similar(["SENT_4"]))
Result:
[('SENT_1', 0.043695494532585144), ('SENT_2', 0.0017897281795740128), ('SENT_0', -0.018954679369926453), ('SENT_3', -0.08253869414329529)]
Similarity of SENT_4
and SENT_3
is only -0.08253869414329529
when it should be 1 since they are exactly same. How should I improve this accuracy? Is there a specific way of training documents and I am missing something out?
Word2Vec/Doc2Vec don't work well on toy-sized examples (such as few texts, short texts, and few total words). Many of the desirable properties are only reliably achieved with training sets of millions of words, or tens-of-thousands of documents.
In particular, with only 5 examples, and only a dozen or two words, but 100-dimensions of modeling vectors, the training isn't forced to do the main thing which makes word-vectors/doc-vectors useful: compress representations into dense embeddings, where similar items need to be incrementally nudged near each other in vector space, because there's no way to retain all the original variation in a sort-of-giant-lookup-table. With more dimensions than corpus variation, your identical-tokens SENT_3
and SENT_4
can adopt wildly different doc-vectors, and the model is still large enough to do great on its training task (essentially, 'overfit'), without the desired end-state of similar-texts having similar-vectors being forced.
You can sometimes squeeze a little more meaning out of small datasets with more training iterations, and a much-smaller model (in terms of vector size
), but really: these vectors need large, varied datasets to become meaningful.
That's the main issue. Some other inefficiencies or errors in your example code:
Your code doesn't use the class LabeledLineSentence
, so there's no need to include it here – it's irrelevant boilerplate. (Also, TaggedDocument
is the preferred name for the words
+tags
document class in recent gensim versions, rather than LabeledSentence
.)
Your custom-management of alpha
and min_alpha
is unlikely to do anything useful. These are best left at their defaults unless you already have something working, understand the algorithm well, and then want to try subtle optimizations.
train()
will do its own iterations, so you don't need to call it many times in an outer loop. (This code as written does in its first loop 5 model.iter
iterations at alpha
values gradually descending from 0.03 to 0.025, then 5 iterations at a fixed alpha of 0.028, then 5 more at 0.026, then 27 more sets of 5 iterations at decreasing alpha, ending on the 30th loop at a fixed alpha of -0.028. That's a nonsense ending value – the learning-rate should never be negative – at the end of a nonsense progression. Even with a big dataset, these 150 iterations, about half happening at negative alpha
values, would likely yield weird results.)