Search code examples
nlptopic-modelingmallet

make Mallet topic-modeling stable


I'm using the mallet topic-modeling tool and have some difficulties to make it stable (the topics that I get are not seemed very logic).

I worked with your tutorial and that one: https://programminghistorian.org/en/lessons/topic-modeling-and-mallet#getting-your-own-texts-into-mallet and I got some questions on that:

  1. Is there some best practices for get that model to work? Except the optimize command (what is a good number for that)? What is good number for iterations command?
  2. I import my data with the import dir command. In that dir there are my files. Is it matter if those files contain a text with new lines or just a very long line?
  3. I read about the hLDA model. When I tried to run it I saw that the only output is the state.txt output that is not very clear. I expect for an output like the topic-modeling model (topic_keys.txt, doc_topics.txt) how can I get them?
  4. When should I use the hLDA rather then the topic-modeling?

Thanks a lot for your help!


Solution

  • Some references for good practices in topic modeling are The Care and Feeding of Topic Models with Jordan Boyd-Graber and Dave Newman and Applied Topic Modeling with Jordan Boyd-Graber and Yuening Hu.

    For hyperparameter optimization --optimize-interval 20 --optimize-burn-in 50 should be fine, it doesn't seem to be very sensitive to specific values. Convergence for Gibbs sampling is hard to assess, the default 1000 iterations should be interpreted as "a number large enough that it's probably ok" rather than a specific value.

    If you are reading individual documents from files in a directory, lines don't matter. If documents are longer than about 1000 tokens before stopword removal, consider breaking them into smaller segments.

    hLDA is only included because people seem to want it, I don't recommend it for any purpose.