python cluster-analysis gensim word2vec doc2vec

How to find most similar terms/words of a document in doc2vec?

I have applied Doc2vec to convert documents into vectors.After that, I used the vectors in clustering and figured out the 5 nearest/most similar document to the centroid of each cluster. Now I need to find the most dominant or important terms of these documents so that I can figure out the characteristics of each cluster. My question is is there any way to figure out the most dominat or simlar terms/word of a document in Doc2vec . I am using python's gensim package for the Doc2vec implementaton

Solution

To find out the most dominant words of your clusters, you can use any of these two classic approaches. I personally found the second one very efficient and effective for this purpose.

Latent Drichlet Allocation (LDA): A topic modelling algorithm that will give you a set of topic given a collection of documents. You can treat the set of similar documents in the clusters as one document and apply LDA to generate the topics and see topic distributions across documents.
TF-IDF: TF-IDF calculate the importance of a word to a document given a collection of documents. Therefore, to find the most important keywords/ngrams, you can calculate TF-IDF for every word that appears in the documents. The words with the highest TF-IDF then are you keywords. So:
- calculate IDF for every single word that appears in the documents based on the number of documents that contain that keyword
- concatenate the text of the similar documents (I 'd call it a super-document) and then calculate TF for each word that appears in this super-document
- calculate TF*IDF for every word... and then TA DAAA... you have your keywords associated with each cluster.
Take a look at Section 5.1 here for more details on the use of TF-IDF.