I have applied Doc2vec to convert documents into vectors.After that, I used the vectors in clustering and figured out the 5 nearest/most similar document to the centroid of each cluster. Now I need to find the most dominant or important terms of these documents so that I can figure out the characteristics of each cluster. My question is is there any way to figure out the most dominat or simlar terms/word of a document in Doc2vec . I am using python's gensim package for the Doc2vec implementaton
To find out the most dominant words of your clusters, you can use any of these two classic approaches. I personally found the second one very efficient and effective for this purpose.
Latent Drichlet Allocation (LDA): A topic modelling algorithm that will give you a set of topic given a collection of documents. You can treat the set of similar documents in the clusters as one document and apply LDA to generate the topics and see topic distributions across documents.
TF-IDF: TF-IDF calculate the importance of a word to a document given a collection of documents. Therefore, to find the most important keywords/ngrams, you can calculate TF-IDF for every word that appears in the documents. The words with the highest TF-IDF then are you keywords. So:
Take a look at Section 5.1 here for more details on the use of TF-IDF.