I use gensim LDA topic modelling to find topics for each document and to check the similarity between documents by comparing the received topics vectors. Each document is given a different number of matching topics, so the comparison of the vector (by cosine similarity) is incorrect because vectors of the same length are required.
This is the related code:
lda_model_bow = models.LdaModel(corpus=bow_corpus, id2word=dictionary, num_topics=3, passes=1, random_state=47)
#---------------Calculating and Viewing the topics----------------------------
vec_bows = [dictionary.doc2bow(filtered_text.split()) for filtered_text in filtered_texts]
vec_lda_topics=[lda_model_bow[vec_bow] for vec_bow in vec_bows]
for id,vec_lda_topic in enumerate(vec_lda_topics):
print ('document ' ,id, 'topics: ', vec_lda_topic)
The output vectors is:
document 0 topics: [(1, 0.25697246), (2, 0.08026043), (3, 0.65391296)]
document 1 topics: [(2, 0.93666667)]
document 2 topics: [(2, 0.07910537), (3, 0.20132676)]
.....
As you can see, each vector has a different length, so it is not possible to perform cosine similarity between them.
I would like the output to be:
document 0 topics: [(1, 0.25697246), (2, 0.08026043), (3, 0.65391296)]
document 1 topics: [(1, 0.0), (2, 0.93666667), (3, 0.0)]
document 2 topics: [(1, 0.0), (2, 0.07910537), (3, 0.20132676)]
.....
Any ideas how to do it? tnx
So as panktijk says in the comment and also this topic , the solution is to cange minimum_probability
from the default value of 0.01
to 0.0
.