Search code examples
javamachine-learningtopic-modelingmallet

Mallet DMR negative propability for feature-based topic-distribution?


I've created a DMR Topic model (via Java API) which calculates the topic distribution based on the publication-year of the documents.

The resulting distribution is a bit confusing, because there are a lot of negative propabilities. Sometimes all propabilities for a whole topic are negative values. See:

enter image description here

Q1: Why are there negative values? The lowest possible possibility for a topic distribution for a given feature should be at least 0,0 ... I guess?

Additional I build a LDA model where the ModelLogLikelihood seems to be surreal. I trained the model with nearly 4 million documents and 20 topics. Alpha =1.0 ; Beta = 0.01 ; # iterations 1000;

Results in Model-Log likelihood: -8.895651309362761E8

Q2: Can this value be correct? Or am I doing something wrong?


Solution

  • Thanks for using DMR! LDA assumes that the prior for the topic distribution for each document is a Dirichlet distribution. The parameters for a K-dimensional Dirichlet are K non-negative real numbers. DMR-LDA generates a document-specific prior based on the properties of a document.

    Q1: These are not probabilities, they are regression coefficients. If you have a document with feature 2014, the value for the Dirichlet parameter for topic 1 with the expression exp(-4.5 + -0.25). This is the default parameter plus the offset for 2014, exponentiated to make it non-negative. These values are equivalent to about 0.01 for the default value with no additional features, and 0.008 (78%) for 2014.

    Q2: This is a common confusion! The key is that this is a log probability. The log function crosses 0 at 1, since anything to the 0 is 1. The log of any value less than 1 is negative. Since all probabilities are less than or equal to one, all log probabilities are zero or negative. The other thing that often surprises people is how large the log probabilities are. Let's say you have a language model where each word token is independent, and the probability of a given word is usually around 1/1000. The log probability of one word is therefore around -7.0. The joint probability of a whole collection is the product of the token probabilities, so the log of that joint probability is the sum of -7. I'm guessing your collection has about 100M tokens?