I am working on text embedding in python. Where I found the similarity between two documents with the Doc2vec model. the code is as follows:
for doc_id in range(len(train_corpus)):
inferred_vector = model.infer_vector(train_corpus[doc_id].words) # it takes each document words as a input and produce vector of each document
sims = model.docvecs.most_similar([inferred_vector], topn=len(model.docvecs)) # it takes list of all document's vector as a input and compare those with the trained vectors and gives the most similarity of 1st document to other and then second to other and so on .
print('Document ({}): «{}»\n'.format(doc_id, ' '.join(train_corpus[doc_id].words)))
print(u'SIMILAR/DISSIMILAR DOCS PER MODEL %s:\n' % model)
for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(sims)//2), ('LEAST', len(sims) - 1)]:
print(u'%s %s: «%s»\n' % (label, sims[index], ' '.join(train_corpus[sims[index][0]].words)))
now, from these two embedded documents, how can I extract a set of semantically similar words of those particular documents.
please, help me out.
Only some Doc2Vec
modes also train word-vectors: dm=1
(the default), or dm=0, dbow_words=1
(DBOW doc-vectors but added skip-gram word-vectors. If you've used such a mode, then there will be word-vectors in your model.wv
property.
A call to model.wv.similarity(word1, word2)
method will give you the pairwise similarity for any 2 words.
So, you could iterate over all the words in doc1
, then collect the similarities to each word in doc2
, and report the single highest similarity for each word.