Here is my code for training my doc2vec model
from gensim.models.doc2vec import Doc2Vec
from FileDocIterator import FileDocIterator
doc_file_name = 'doc_6million.txt'
docs = FileDocIterator(doc_file_name)
print "Fitting started"
model = Doc2Vec(docs, size=100, window=5, min_count=5, negative=20, workers=6, iter=4)
print "Saving model"
model.save("doc2vec_model")
print "model saved"
Now lets take a look at FileDocIterator
import json
from gensim.models.doc2vec import TaggedDocument
from gensim.models import Phrases
class FileDocIterator(object):
def __init__(self, fileName):
self.fileName = fileName
self.phrase = Phrases.load("phrases")
def __iter__(self):
for line in open(self.fileName):
jsData = json.loads(line)
yield TaggedDocument(words=jsData["data"], tags=jsData["id"])
Now I do understand that phrases isn't being used in this implementation, but bear with me here, lets take a look at how the data looks like. Here is the first data point
{"data":["strategic","and","analytical","technical","program","director","and","innovator","who","inspires","calculated","risk-taking","in","emerging","technologies",",","such","as","cyber","security",",","risk",",","analytics",",","big","data",",","cloud",",","mobility","and","3d","printing",".","known","for","growing","company","profit","through","innovative","thinking","aimed","at","improving","employee","productivity","and","providing","solutions","to","private","industry","and","government","customers",".","recognized","for","invigorating","creative","thinking","and","collaboration","within","large","companies","to","leverage","their","economies","of","scale","to","capture","market","share",".","successful","in","managing","the","risk","and","uncertainty","throughout","the","innovation","lifecycle","by","leveraging","an","innovation","management","framework","to","overcome","barriers",".","track","record","of","producing","results","in","competitive",",","rapidly","changing","environments","where","innovation","and","customer","satisfaction","is","the","business",".","competencies","include",":","innovation","management","cyber",",","risk",",","analytics",",","cloud","computing","and","mobility","technology","development","security","compliance",":","dod/ic","(","nispom",",","icd","503",",","fedramp",")","commercial","(","iso/iec","27002",",","pci","dss",")","relationship","management",":","dod",",","public","sector","and","intelligence","community","change","management","it","security","&","risk","management","(","cissp",")","program",",","product","&","portfolio","management","(","pmp",")","data","analytics","management","(","cchd",")","itil","service",
"management","(","itilv3-expert",")"],
"id":"55c37f730d03382935e12767"}
My understanding is that the id, 55c37f730d03382935e12767
should be the id of the document, so doing the following ought to give me back a docVector.
model.docvecs["55c37f730d03382935e12767"]
Instead, this is what is outputed.
>>> model.docvecs["55c37f730d03382935e12767"]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/gensim/models/doc2vec.py", line 341, in __getitem__
return self.doctag_syn0[self._int_index(index)]
File "/usr/local/lib/python2.7/dist-packages/gensim/models/doc2vec.py", line 315, in _int_index
return self.max_rawint + 1 + self.doctags[index].offset
KeyError: '55c37f730d03382935e12767'
Trying to get most similar gives the following back
>>> model.docvecs.most_similar("55c37f730d03382935e12767")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/gensim/models/doc2vec.py", line 450, in most_similar
raise KeyError("doc '%s' not in trained set" % doc)
KeyError: "doc '55c37f730d03382935e12767' not in trained set"
What I'm trying to understand is how are doc vectors saved and what id's are used. What part of my approach isn't working above?
Now here's something interesting, if I do the following I get back similar doc vectors but they have no meaning to me.
>>> model.docvecs.most_similar(str(1))
[(u'8', 0.9000369906425476), (u'3', 0.8878246545791626), (u'7', 0.886141836643219), (u'2', 0.8834314942359924), (u'e', 0.8812381029129028), (u'a', 0.8648831248283386), (u'd', 0.8587037920951843), (u'0', 0.8413013219833374), (u'4', 0.8385311365127563), (u'c', 0.8290119767189026)]
TaggedDocument.tags
should be a list of tags, not a string. By providing a string, the library sees it as a list-of-characters, so the single-characters are interpreted as the document-tags. Change your line:
yield TaggedDocument(words=jsData["data"], tags=jsData["id"])
...to...
yield TaggedDocument(words=jsData["data"], tags=[jsData["id"]])
...and you will likely see the expected results.