I am trying to understand how LDA can be used for text-retrieval, and I am currently using the gensim's LdaModel model for implementing LDA, here: https://radimrehurek.com/gensim/models/ldamodel.html.
I have managed to identify the k topics and their most-used words, and I understand that LDA is about probabilistic distributions of topics and how words are distributed within those topics in the documents, so that much makes sense.
That said, I do not understand how to use the LdaModel to retrieve the documents that are relevant to a string input of search query eg "negative effects of birth control". I have tried inferring topic distributions on the search query and finding similarities between the topic distribution on the search query and the topic distributions from the corpus using gensim's similarities.MatrixSimilarity to compute cosine similarity like so:
lda = LdaModel(corpus, num_topics=10)
index = similarities.MatrixSimilarity(lda[corpus])
query = lda[query_bow]
sims = index[query]
But the performance isn't really good. What I figure is that finding the topic distribution of the search query is not too meaningful because there is usually only 1 topic in the search query. But I don't know how else I could implement this on the LdaModel on gensim. Any advice would be really appreciated, I am new to topic modeling and maybe I am missing something that's glaringly obvious to me? Thanks!
I believe your text query lengths are too small and/or your ratio of number of topics to length of query is too small for what you want to achieve.
If you want to use LDA to find similar topics to a given query, in most cases you will in deed need more than one topic per query to be able to present a specific document rather than a whole section of documents.
Your LDA model above has only 10 topics so your chances of finding more than one topic in a given sentence are very low. So already, I would suggest testing if training the model on 100 or 200 topics makes this a bit better. Now you have high chances of falling onto several topics in one sentence.
Here is an (oversimplified) example of why this could work:
With num_topics=10
you might have topics:
topic_1: "pizza", "pie", "fork", dinner", "farm",...
topic_2: "pilot", "navy", "ocean", "air", "USA", ...
...
Now if you query the sentence
"Tonight at dinner I will eat pizza with a fork"
You will only get topic_1
as a response
However, with num_topics=200
your topics might be something like this
topic_1: "pizza", "margherita", "funghi",...
topic_2: "fork", "knife", "spoon",...
topic_3: "dinner", "date", "lunch", ...
So the same sentence now covers topic_1
, topic_2
, topic_3
.
Now it depends a lot on your corpus if increasing the number of topics that much will make the output good. For something as large as the English Wikipedia, 200 topics works. For a smaller corpus this is not clear.
And even with more topics, I believe it could still be the case that your query text is just too short.