Search code examples
pythongensimdoc2vec

gensim doc2vec documents not found by id


Here is my code for training my doc2vec model

from gensim.models.doc2vec import Doc2Vec
from FileDocIterator import FileDocIterator

doc_file_name = 'doc_6million.txt'
docs = FileDocIterator(doc_file_name)
print "Fitting started"
model = Doc2Vec(docs, size=100, window=5, min_count=5, negative=20, workers=6, iter=4)
print "Saving model"
model.save("doc2vec_model")
print "model saved"

Now lets take a look at FileDocIterator

import json

from gensim.models.doc2vec import TaggedDocument
from gensim.models import Phrases

class FileDocIterator(object):
    def __init__(self, fileName):
        self.fileName = fileName
        self.phrase = Phrases.load("phrases")

    def __iter__(self):
        for line in open(self.fileName):
            jsData = json.loads(line)
            yield TaggedDocument(words=jsData["data"], tags=jsData["id"])

Now I do understand that phrases isn't being used in this implementation, but bear with me here, lets take a look at how the data looks like. Here is the first data point

{"data":["strategic","and","analytical","technical","program","director","and","innovator","who","inspires","calculated","risk-taking","in","emerging","technologies",",","such","as","cyber","security",",","risk",",","analytics",",","big","data",",","cloud",",","mobility","and","3d","printing",".","known","for","growing","company","profit","through","innovative","thinking","aimed","at","improving","employee","productivity","and","providing","solutions","to","private","industry","and","government","customers",".","recognized","for","invigorating","creative","thinking","and","collaboration","within","large","companies","to","leverage","their","economies","of","scale","to","capture","market","share",".","successful","in","managing","the","risk","and","uncertainty","throughout","the","innovation","lifecycle","by","leveraging","an","innovation","management","framework","to","overcome","barriers",".","track","record","of","producing","results","in","competitive",",","rapidly","changing","environments","where","innovation","and","customer","satisfaction","is","the","business",".","competencies","include",":","innovation","management","cyber",",","risk",",","analytics",",","cloud","computing","and","mobility","technology","development","security","compliance",":","dod/ic","(","nispom",",","icd","503",",","fedramp",")","commercial","(","iso/iec","27002",",","pci","dss",")","relationship","management",":","dod",",","public","sector","and","intelligence","community","change","management","it","security","&","risk","management","(","cissp",")","program",",","product","&","portfolio","management","(","pmp",")","data","analytics","management","(","cchd",")","itil","service",
"management","(","itilv3-expert",")"],
"id":"55c37f730d03382935e12767"}

My understanding is that the id, 55c37f730d03382935e12767 should be the id of the document, so doing the following ought to give me back a docVector.

model.docvecs["55c37f730d03382935e12767"]

Instead, this is what is outputed.

>>> model.docvecs["55c37f730d03382935e12767"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/doc2vec.py", line 341, in __getitem__
    return self.doctag_syn0[self._int_index(index)]
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/doc2vec.py", line 315, in _int_index
    return self.max_rawint + 1 + self.doctags[index].offset
KeyError: '55c37f730d03382935e12767'

Trying to get most similar gives the following back

>>> model.docvecs.most_similar("55c37f730d03382935e12767")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/doc2vec.py", line 450, in most_similar
    raise KeyError("doc '%s' not in trained set" % doc)
KeyError: "doc '55c37f730d03382935e12767' not in trained set"

What I'm trying to understand is how are doc vectors saved and what id's are used. What part of my approach isn't working above?

Now here's something interesting, if I do the following I get back similar doc vectors but they have no meaning to me.

>>> model.docvecs.most_similar(str(1))
[(u'8', 0.9000369906425476), (u'3', 0.8878246545791626), (u'7', 0.886141836643219), (u'2', 0.8834314942359924), (u'e', 0.8812381029129028), (u'a', 0.8648831248283386), (u'd', 0.8587037920951843), (u'0', 0.8413013219833374), (u'4', 0.8385311365127563), (u'c', 0.8290119767189026)]

Solution

  • TaggedDocument.tags should be a list of tags, not a string. By providing a string, the library sees it as a list-of-characters, so the single-characters are interpreted as the document-tags. Change your line:

                yield TaggedDocument(words=jsData["data"], tags=jsData["id"])
    

    ...to...

                yield TaggedDocument(words=jsData["data"], tags=[jsData["id"]])
    

    ...and you will likely see the expected results.