python gensim topic-modeling cosine-similarity

Calculating cosine similarity from a Gensim model

I'm trying to calculate a between-topic cosine similarity score from a Gensim LDA topic model, but this proves more complicated than I first expected.

Gensim has a method to calculate distances between topics model.diff(model), but unfortunately cosine distance is not implemented; it has jaccard distance, but it is a bit too vector-length dependent (i.e., when comparing top 100 most important words per topic the distance is lower than comparing top 500, and the distance is 0 when full-length vectors are compared, as each topic includes all terms, but with different probabilities).

My problem is that the output from the model looks like this (only shown 4 top words):

(30, '0.008*"tax" + 0.004*"cut" + 0.004*"bill" + 0.004*"spending"')
(18, '0.009*"candidate" + 0.009*"voter" + 0.009*"vote" + 0.009*"election"')
(42, '0.047*"shuttle" + 0.034*"astronaut" + 0.026*"launch" + 0.025*"orbit"')
(22, '0.023*"boat" + 0.020*"ship" + 0.015*"migrant" + 0.013*"vessel"')

So, in order to calculate the cosine sim/distance, I would have to parse the second element of the tuple (i.e., the '0.008*"tax" +...' part, which indicates term probabilities.

I was wondering whether there is an easier way to get cosine similarity out of the model? Or parsing each individual string of term/probabilities is really the only way to go?

Thanks for the help.

Solution

The get_topics() method gives you a full (sparse) array where each row is a topic, and each column a vocabulary word. So you may be able to calculate topic-to-topic cosine-similarities something roughly like:

from sklearn.metrics.pairwise import cosine_similarity

topics = lda_model.get_topics()
sim_18_to_30 = cosine_similarity(topics[18], topics[30])   # topic 18 to topic 30
all_sims = cosine_similarity(topics)  # all pairwise similarities

(I haven't checked this code on a live model; exact required shapes/etc may be off.)