Search code examples
nlpmallet

Question about Latent Dirichlet Allocation (MALLET)


Honestly, I'm not familiar with LDA, but am required to use MALLET's topic modeling for one of my projects.

My question is: given a set of documents within a specific timestamp as the training data for the topic model, how appropriate is it to use the model (using the inferencer) to track the topic trends, for documents + or - the training data's timestamp. I mean, is the topic distributions being provided by MALLET a suitable metric to track the popularity of the topics over time if during the model building stage, we only provide a subset of the dataset I am required to analyze.

thanks.


Solution

  • Are you famailiar with Latent Semantic Indexing? Latent Dirichlet Analysis is just a different way of doing the same kind of thing, so LSI or pLSI you may be an easier starting point to gain knowledge about the goals of LDA.

    All three techniques lock on to topics in an unsupervised fashion (you tell it how many topics to look for), and then assume that each document covers each topic in varying proportions. Depending on how many topics you allocate, they may behave more like subfields of whatever your corpus is about, and may not be as specific as the "topics" that people think about when they think about trending topics in the news.

    Somehow I suspect that you want to assume that each document represents a particular topic. LSI/pLSI/LDA don't do this -- they model each document as a mixture of topics. That doesn't mean you won't get good results, or that this isn't worth trying, but I suspect (though I don't have a comprehensive knowledge of LSI literature) that you'd be tackling a brand new research problem.

    (FWIW, I suspect that using clustering methods like k-Means more readily model the assumption that each document has exactly one topic.)