Search code examples
ldatopic-modelingtopicmodels

Is it possible to use topic modeling for a single document


Is it rational to use topic modelling for a single document or to be more precise is it mathematically okay to use LDA-gibbs method for a single document.If so what should be value of k and seed. Also what is be the role of k and seed for single as well as large set of documents.

K and SEED are variable of the function LDA (in r studio). Also let me know if I am wrong anywhere in this question.

To tell about my project ,I am trying to find out the main topics which can be used to represent the content of a single document.

I have already tried using k=4,7,10.Part of my question also is what value of k should be better.


Solution

  • It really depends on the document. A document could be a 700 page book or a single sentence. Your k is also going to be dependent on the document I think you mean the number of topics? If your document is the entire Wikipedia corpus 1500 topics might be appropriate if your document is a list of comments about movies then 20 topics might be appropriate. Optimizing that number can be done using the elbow method check out 17.

    Seed can be pretty random it's just a leaver so your results can be replicated - it runs if you leave it blank. I would say try it and check your coherence, eyeball your topics and if it looks right then sure you can train an LDA on one document. A single document should process pretty fast.

    Here is an example in python of using seed parameters. My data set is 1,048,575 rows note the seed is much higher:

    ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=bow_corpus,
       num_topics=20, alpha =.1, id2word=dictionary, iterations = 1000, 
       random_seed = 569356958)