Topic Modelling and finding similarity in topics

Problem statement: I have several documents(20k documents). I need to apply Topic modelling to find similar documents and then analyze those similar documents to find how those are different from each other. Q: Could anyone suggest me any Topic modelling package through which I can achieve this. I am exploring Mallet and Gensim Python. Not sure which would best fit in my requirement.

Any help would be highly appreciated.

Solution

I don't know Gensim Python, but MALLET could be a solution. Assuming you have Java expertise, it shouldn't be too difficult.

Create a cc.mallet.types.InstanceList with your data and fit a cc.mallet.topics.SimpleLDA model. Then, for each cc.mallet.types.Instance (Instances are your documents), compute a divergence metric to each other Instance. For this, you will need to compute the probability of each topic within each Instance, which is slightly tricky. In SimpleLDA, there is an ArrayList<TopicAssignment> data object that holds Instances and their cc.mallet.topics.TopicAssignment. A TopicAssignment contains a cc.mallet.types.LabelSequence called topicSequence, which holds the the topic assignment for each word. You will need to loop through this to get counts for each topic. Then, the the probability of topic i in document j is simply (#words assigned to topic i in doc j) / (total words in doc j). Store these probabilities and use them to compute the divergence metric of your choice (e.g., KL divergence).