I'm trying to calculate a between-topic cosine similarity score from a Gensim
LDA topic model, but this proves more complicated than I first expected.
Gensim
has a method to calculate distances between topics model.diff(model)
, but unfortunately cosine distance is not implemented; it has jaccard distance, but it is a bit too vector-length dependent (i.e., when comparing top 100 most important words per topic the distance is lower than comparing top 500, and the distance is 0 when full-length vectors are compared, as each topic includes all terms, but with different probabilities).
My problem is that the output from the model looks like this (only shown 4 top words):
(30, '0.008*"tax" + 0.004*"cut" + 0.004*"bill" + 0.004*"spending"')
(18, '0.009*"candidate" + 0.009*"voter" + 0.009*"vote" + 0.009*"election"')
(42, '0.047*"shuttle" + 0.034*"astronaut" + 0.026*"launch" + 0.025*"orbit"')
(22, '0.023*"boat" + 0.020*"ship" + 0.015*"migrant" + 0.013*"vessel"')
So, in order to calculate the cosine sim/distance, I would have to parse the second element of the tuple (i.e., the '0.008*"tax" +...'
part, which indicates term probabilities.
I was wondering whether there is an easier way to get cosine similarity out of the model? Or parsing each individual string of term/probabilities is really the only way to go?
Thanks for the help.
The get_topics()
method gives you a full (sparse) array where each row is a topic, and each column a vocabulary word. So you may be able to calculate topic-to-topic cosine-similarities something roughly like:
from sklearn.metrics.pairwise import cosine_similarity
topics = lda_model.get_topics()
sim_18_to_30 = cosine_similarity(topics[18], topics[30]) # topic 18 to topic 30
all_sims = cosine_similarity(topics) # all pairwise similarities
(I haven't checked this code on a live model; exact required shapes/etc may be off.)