Search code examples
nlptopic-modeling

Can topic models be used on a small text?


I have a single document which includes totally 438 sentences (so it is not very big). But, I am wondering if I can use a topic modeling system to tell me which sentences are more related. Is it possible?

As I have seen in all papers and topics about topic modeling, these systems usually work based on very large corpora. I would like to know how accurate the systems will be on such a small dataset.

Meanwhile, my main aim is not to do topic modeling for the text, but I want to use it just as a feature (whether the two sentences belong to the same topic ro not) to do another task.

I would also like to know how the topics are determined? Is there any pre-defined set of topics in each topic modelling tool? Or they are user-defined topics?

Best Regard,


Solution

  • Yes, it's possible. Treat every sentence as a document in a standard topic modelling technique such as Latent Dirichlet Allocation (LDA).

    The topics are not determined a priori. In LDA, a topic is essentially a distribution over terms. You just need to pre-specify the number of topics. Words co-occurring frequently would tend to belong to the same topic.

    To answer your second question: "Meanwhile, my main aim is not to do topic modeling for the text, but I want to use it just as a feature (whether the two sentences belong to the same topic ro not) to do another task."...

    After computing the theta matrix (NxK) (N:=#docs, K:=#topics), you can compute metrics such as KL-divergence etc. on these N distributions (one for each document) to know which documents are topically related to one another.