I have a single document which includes totally 438 sentences (so it is not very big). But, I am wondering if I can use a topic modeling system to tell me which sentences are more related. Is it possible?
As I have seen in all papers and topics about topic modeling, these systems usually work based on very large corpora. I would like to know how accurate the systems will be on such a small dataset.
Meanwhile, my main aim is not to do topic modeling for the text, but I want to use it just as a feature (whether the two sentences belong to the same topic ro not) to do another task.
I would also like to know how the topics are determined? Is there any pre-defined set of topics in each topic modelling tool? Or they are user-defined topics?
Best Regard,
Yes, it's possible. Treat every sentence as a document in a standard topic modelling technique such as Latent Dirichlet Allocation (LDA).
The topics are not determined a priori. In LDA, a topic is essentially a distribution over terms. You just need to pre-specify the number of topics. Words co-occurring frequently would tend to belong to the same topic.
To answer your second question: "Meanwhile, my main aim is not to do topic modeling for the text, but I want to use it just as a feature (whether the two sentences belong to the same topic ro not) to do another task."...
After computing the theta matrix (NxK) (N:=#docs, K:=#topics), you can compute metrics such as KL-divergence etc. on these N distributions (one for each document) to know which documents are topically related to one another.