i am new in word/paragraph embedding and trying to understand via doc2vec in GENSIM. I would like to seek advice on whether my understanding is incorrect. My understanding is that doc2vec is potentially able to return documents that may have semantically similar content. As a test, i tried the following and have the following questions.
Question 1: I noted that every run of training with the exact same parameters and examples will result in a model that produces very different results from previous trains (E.g. Different vectors and different ranking of similar documents eveytime).. Why is this so indeterministic? As such, can this be reliably used for any practical work?
Question 2: Why am i not getting the tag ids of the top similar documents instead? Results: [('day',0.477),('2016',0.386)....
Question 2 answer: The problem was due to model.most_similar, should use model.docvecs.most_similar instead
Please advise if i misunderstood anything?
Data prep
I had created multiple documents with a sentence each. I had deliberately made it such that they are distinctly different semantically.
A: It is a fine summer weather, with the birds singing and sun shining bright.
B: It is a lovely day indeed, if only i had a degree in appreciating.
C: 2016-2017 Degree in Earth Science Earthly University
D: 2009-2010 Dip in Life and Nature Life College
Query: Degree in Philosophy from Thinking University from 2009 to 2010
Training
I trained the documents (tokens as words, running index as tag)
tdlist=[]
docstring=['It is a fine summer weather, with the birds singing and sun shining bright.',
'It is a lovely day indeed, if only i had a degree in appreciating.',
'2016-2017 Degree in Earth Science Earthly University',
'2009-2010 Dip in Life and Nature Life College']
counter=1
for para in docstring:
tokens=tokenize(para) #This will also strip punctuation
td=TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(tokens))).split(), str(counter))
tdlist.append(td)
counter=counter+1
model=gensim.models.Doc2Vec(tdlist,dm=0,alpha=0.025, size=20, min_alpha=0.025, min_count=0)
for epoch in range(200):
model.train(tdlist, total_examples=model.corpus_count, epochs=model.iter)
Inference
I then attempted to infer the query. Although they are many missing words in the vocab for the query, i would expect closest document similarity results for C and D. But the results only gave me a list of 'words' followed by a similarity score. I am unsure if my understanding is wrong. Below is my code extract.
mydocvector=model.infer_vector(['Degree' ,'in' ,'Philosophy' ,'from' ,'Thinking' ,'University', 'from', '2009', 'to', '2010'])
print(model.docvecs.most_similar(positive=[mydocvector])
Doc2Vec doesn't work well on toy-sized datasets - few documents, few total words, few words per document. You'll absolutely want more documents than vector dimensions (size
), and ideally tens-of-thousands of documents or more.
The second argument to TaggedDocument
should be a list of tags. By supplying a single string-of-an-int, each of its elements (characters) will be seen as tags. (With just documents 1
to 4
this won't yet hurt, but as soon as you have document 10
, Doc2Vec will see it as tags 1
and 0
, unless you supply it as ['10']
(a single-element list).
Yes, to find most-similar documents you use model.docvecs.most_similar()
rather than model.most_similar()
(which only operates on learned words, if any).
You are using dm=0
mode, which is a pretty good starting idea – it's fast and often a top-performer. But note that this mode doesn't train word-vectors too. So anything you ask for from the top model, like model['summer']
or model.most_similar('sun')
, will be nonsense results based on randomly-initialized but never-trained words. (If you need words trained too, either add dbow_words=1
to the dm=0
mode, or use a dm=1
mode. But for pure doc-vectors, dm=0
is a pretty good choice.)
There's no need to call train()
in a loop - or indeed at all, given the line above it. The form you've used to instantiate Doc2Vec
, with an actual corpus tdlist
as the first argument, already triggers model-setup and training, using the default number of iter
passes (5) and the supplied alpha
and min_alpha
. Now, for Doc2Vec training you often want more passes (10 to 20 are common, though smaller datasets might benefit from even more). And for any training, for properly gradient-descent, you want the effective learning-rate alpha
to gradually decline to a negligible value, such as the default 0.0001
(rather than a forced same-as-starting value).
The only situation where you'd usually call train()
explicitly is if you instantiate the model without a corpus. In that case, you'd need to both call model.build_vocab(tdlist)
(to let the model initialize with a discovered vocabulary), and then some form of train()
- but you'd still need only one call to train, supplying the desired number of passes. (Allowing the default model.iter
5 passes, inside an outer loop of 200 iterations, means a total of 1000 passes over the data... and all at the same fixed alpha
, which is not proper gradient-descent.)
When you have a beefier dataset, you may find results improve with a higher min_count
. Usually words that appear only a few times can't contribute much meaning, and thus only serve as noise slowing training and interfering with other vectors becoming more expressive. (Don't assume "more words must equal better results".) Throwing out the singletons, or more, usually helps.
Regarding inference, almost none of the words in your inference text are in the training set. (I only see 'Degree', 'in', and 'University' repeated.) So in addition to all the issues above, inferring a good vector for the example text would be hard. With a richer training set, you'd likely get better results. It also often helps to increase the steps
optional parameter to infer_vector()
far above its default of 5.