I'm trying to use gensim's (ver 1.0.1) doc2vec
to get the cosine similarities of documents. This should be relatively simple, but I'm having problems retrieving the vector of the documents so I can do cosine similarity. When I try to retrieve a document by the label I gave it in training, I get a key error.
For example,
print(model.docvecs['4_99.txt'])
will tell me that there is no such key as 4_99.txt
.
However if I print print(model.docvecs.doctags)
I see things like this:
'4_99.txt_3': Doctag(offset=1644, word_count=12, doc_count=1)
So it appears that for every document, doc2vec
is saving each sentence as the "document name underscore number"
So I'm either
A) training incorrectly or
B) Don't understand how to retrieve the doc vector so that I can do similarity(d1, d2)
Can anyone help me out here?
Here is how I train my doc2vec:
#Obtain txt abstracts and txt patents
filedir = os.path.abspath(os.path.join(os.path.dirname(__file__)))
files = os.listdir(filedir)
#Doc2Vec takes [['a', 'sentence'], 'and label']
docLabels = [f for f in files if f.endswith('.txt')]
sources = {} #{'2_139.txt': '2_139.txt'}
for lable in docLabels:
sources[lable] = lable
sentences = LabeledLineSentence(sources)
model = Doc2Vec(min_count=1, window=10, size=100, sample=1e-4, negative=5, workers=8)
model.build_vocab(sentences.to_array())
for epoch in range(10):
model.train(sentences.sentences_perm())
model.save('./a2v.d2v')
This uses this class
class LabeledLineSentence(object):
def __init__(self, sources):
self.sources = sources
flipped = {}
# make sure that keys are unique
for key, value in sources.items():
if value not in flipped:
flipped[value] = [key]
else:
raise Exception('Non-unique prefix encountered')
def __iter__(self):
for source, prefix in self.sources.items():
with utils.smart_open(source) as fin:
for item_no, line in enumerate(fin):
yield LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no])
def to_array(self):
self.sentences = []
for source, prefix in self.sources.items():
with utils.smart_open(source) as fin:
for item_no, line in enumerate(fin):
self.sentences.append(LabeledSentence(utils.to_unicode(line).split(), [prefix + '_%s' % item_no]))
return self.sentences
def sentences_perm(self):
shuffle(self.sentences)
return self.sentences
I got this class from a web tutorial (https://medium.com/@klintcho/doc2vec-tutorial-using-gensim-ab3ac03d3a1) to help me get around Doc2Vec's weird data formatting requirements and I don't completely understand it to be honest. It does look like this class written here is adding the _n
for each sentence, but in the tutorial it seems that they still retrieve the document vector with just giving it the filename... So what am I doing wrong here?
The gensim Doc2Vec class uses exactly the document 'tags' you've passed it during training as keys to the doc-vectors.
And yes, that LabeledLineSentence
class is adding _n
to the document-tags. Specifically, those appear to be the line-numbers from the associated files.
So you'll have to request vectors using those same keys that were provided during training, with the _n
– if what you really want is a vector-per-line.
If you instead want each file to be its own document, you'll need to change the corpus class to use the whole file as a document. Looking at the tutorial you reference, it appears they have a second LabeledLineSentence
class that isn't line-oriented (but still is named that way), but you're not using that variant.
Separately, you don't need to loop and call train()
multiple times, and manually adjust the alpha
. That's almost certainly not doing what you intend, in any recent version of gensim, where train()
already iterates over the corpus multiple times. In the most recent versions of gensim there will even be an error if you call it that way, since many outdated examples on the web encourage this mistake.
Just call train()
once – it will iterate over your corpus the number of times specified when the model was constructed. (That's a default of 5, but controllable with the iter
initialization parameter. And, 10 or more is common with Doc2Vec corpuses.)