Search code examples
pythonnlpgensimword2vecdoc2vec

How can I improve the cosine similarity of two documents(sentences) in doc2vec model?


I am building a NLP chat application in Python using gensim library through doc2vec model. I have hard coded documents and given a set of training examples, I am testing the model by throwing a user question and then finding most similar documents as a first step. In this case my test question is an exact copy of a document from training example.

import gensim
from gensim import models
sentence = models.doc2vec.LabeledSentence(words=[u'sampling',u'what',u'is',u'tell',u'me',u'about'],tags=["SENT_0"])
sentence1 = models.doc2vec.LabeledSentence(words=[u'eligibility',u'what',u'is',u'my',u'limit',u'how',u'much',u'can',u'I',u'claim'],tags=["SENT_1"])
sentence2 = models.doc2vec.LabeledSentence(words=[u'eligibility',u'I',u'am',u'retiring',u'how',u'much',u'can',u'claim',u'have', u'resigned'],tags=["SENT_2"])
sentence3 = models.doc2vec.LabeledSentence(words=[u'what',u'is',u'my',u'eligibility',u'post',u'my',u'promotion'],tags=["SENT_3"])
sentence4 = models.doc2vec.LabeledSentence(words=[u'what',u'is', u'my',u'eligibility' u'post',u'my',u'promotion'], tags=["SENT_4"])
sentences = [sentence, sentence1, sentence2, sentence3, sentence4]
class LabeledLineSentence(object):
    def __init__(self, filename):
        self.filename = filename
    def __iter__(self):
        for uid, line in enumerate(open(filename)):
            yield LabeledSentence(words=line.split(), labels=['SENT_%s' % uid])
model = models.Doc2Vec(alpha=0.03, min_alpha=.025, min_count=2)
model.build_vocab(sentences)
for epoch in range(30):
    model.train(sentences, total_examples=model.corpus_count, epochs = model.iter)
    model.alpha -= 0.002  # decrease the learning rate`
    model.min_alpha = model.alpha  # fix the learning rate, no decay
model.save("my_model.doc2vec")
model_loaded = models.Doc2Vec.load('my_model.doc2vec')
print (model_loaded.docvecs.most_similar(["SENT_4"]))

Result:

[('SENT_1', 0.043695494532585144), ('SENT_2', 0.0017897281795740128), ('SENT_0', -0.018954679369926453), ('SENT_3', -0.08253869414329529)]

Similarity of SENT_4 and SENT_3 is only -0.08253869414329529 when it should be 1 since they are exactly same. How should I improve this accuracy? Is there a specific way of training documents and I am missing something out?


Solution

  • Word2Vec/Doc2Vec don't work well on toy-sized examples (such as few texts, short texts, and few total words). Many of the desirable properties are only reliably achieved with training sets of millions of words, or tens-of-thousands of documents.

    In particular, with only 5 examples, and only a dozen or two words, but 100-dimensions of modeling vectors, the training isn't forced to do the main thing which makes word-vectors/doc-vectors useful: compress representations into dense embeddings, where similar items need to be incrementally nudged near each other in vector space, because there's no way to retain all the original variation in a sort-of-giant-lookup-table. With more dimensions than corpus variation, your identical-tokens SENT_3 and SENT_4 can adopt wildly different doc-vectors, and the model is still large enough to do great on its training task (essentially, 'overfit'), without the desired end-state of similar-texts having similar-vectors being forced.

    You can sometimes squeeze a little more meaning out of small datasets with more training iterations, and a much-smaller model (in terms of vector size), but really: these vectors need large, varied datasets to become meaningful.

    That's the main issue. Some other inefficiencies or errors in your example code:

    • Your code doesn't use the class LabeledLineSentence, so there's no need to include it here – it's irrelevant boilerplate. (Also, TaggedDocument is the preferred name for the words+tags document class in recent gensim versions, rather than LabeledSentence.)

    • Your custom-management of alpha and min_alpha is unlikely to do anything useful. These are best left at their defaults unless you already have something working, understand the algorithm well, and then want to try subtle optimizations.

    • train() will do its own iterations, so you don't need to call it many times in an outer loop. (This code as written does in its first loop 5 model.iter iterations at alpha values gradually descending from 0.03 to 0.025, then 5 iterations at a fixed alpha of 0.028, then 5 more at 0.026, then 27 more sets of 5 iterations at decreasing alpha, ending on the 30th loop at a fixed alpha of -0.028. That's a nonsense ending value – the learning-rate should never be negative – at the end of a nonsense progression. Even with a big dataset, these 150 iterations, about half happening at negative alpha values, would likely yield weird results.)