Search code examples
pythoncluster-analysisgensimword2vecdoc2vec

How to find most similar terms/words of a document in doc2vec?


I have applied Doc2vec to convert documents into vectors.After that, I used the vectors in clustering and figured out the 5 nearest/most similar document to the centroid of each cluster. Now I need to find the most dominant or important terms of these documents so that I can figure out the characteristics of each cluster. My question is is there any way to figure out the most dominat or simlar terms/word of a document in Doc2vec . I am using python's gensim package for the Doc2vec implementaton


Solution

  • To find out the most dominant words of your clusters, you can use any of these two classic approaches. I personally found the second one very efficient and effective for this purpose.

    • Latent Drichlet Allocation (LDA): A topic modelling algorithm that will give you a set of topic given a collection of documents. You can treat the set of similar documents in the clusters as one document and apply LDA to generate the topics and see topic distributions across documents.

    • TF-IDF: TF-IDF calculate the importance of a word to a document given a collection of documents. Therefore, to find the most important keywords/ngrams, you can calculate TF-IDF for every word that appears in the documents. The words with the highest TF-IDF then are you keywords. So:

      • calculate IDF for every single word that appears in the documents based on the number of documents that contain that keyword
      • concatenate the text of the similar documents (I 'd call it a super-document) and then calculate TF for each word that appears in this super-document
      • calculate TF*IDF for every word... and then TA DAAA... you have your keywords associated with each cluster.

      Take a look at Section 5.1 here for more details on the use of TF-IDF.