Search code examples
rldatopic-modeling

Different dimensions of distributions of topics


I would like to divide all documents in 10 topics, and it goes well with a converged result except for the dimensions of distributions and covariance matrix of topic.
Why the topics distribution is a 9 dimension vector instead of 10 and their covariance matrix is 9*9 matrix instead of 10*10?

I have use library(topicmodels) and function CTM() to implement the topic model in Chinese.

my code is below:

library(rJava);
library(Rwordseg); 
library(NLP);
library(tm); 
library(tmcn)
library(tm)
library(Rwordseg)
library(topicmodels)

installDict("C:\\Users\\Jeffy\\OneDrive\\Workplace\\R\\Law.scel","Law");
installDict("C:\\Users\\Jeffy\\OneDrive\\Workplace\\R\\NationalInstitution.scel","NationalInstitution");
installDict("C:\\Users\\Jeffy\\OneDrive\\Workplace\\R\\Place.scel","Place");
installDict("C:\\Users\\Jeffy\\OneDrive\\Workplace\\R\\Psychology.scel","Psychology");
installDict("C:\\Users\\Jeffy\\OneDrive\\Workplace\\R\\Politics.scel","Politics");
listDict();

#read file
d.vec <- segmentCN("samgovWithoutID.csv", returnType = "tm")
samgov.segment <- read.table("samgovWithoutID.segment.csv", header = TRUE, fill = TRUE, stringsAsFactors = F, sep = ",",fileEncoding='utf-8')
fix(samgov.segment)

# create DTM(document term matrix)
d.corpus <- Corpus(VectorSource(samgov.segment$content))
inspect(d.corpus[1:10])
d.corpus <- tm_map(d.corpus, removeWords, stopwordsCN())
ctrl <- list(removePunctuation = TRUE, removeNumbers= TRUE, wordLengths = c(1, Inf), stopwords = stopwordsCN(), wordLengths = c(2, Inf))
d.dtm <- DocumentTermMatrix(d.corpus, control = ctrl)
inspect(d.dtm[1:10, 110:112])

# impletment topic models
ctm10<-CTM(d.dtm,k=10, control=list(seed=2014012692))
Terms10 <- terms(ctm10, 10)
Terms10[,1:10]

ctm20<-CTM(d.dtm,k=20, control=list(seed=2014012692))
Terms20 <- terms(ctm20, 20)
Terms20[,1:20]

The result in R Studio (see Highlighted part):

enter image description here

Help document:

enter image description here


Solution

  • A probability distribution over 10 values has 9 free parameters: once I tell you the probability of the first 9, the probability of the last value has to be one minus the sum of those probabilities.

    A 10-dimensional logistic normal distribution is equivalent to sampling a 10-dimensional vector from a Gaussian distribution and then "squashing" that vector by exponentiating it and normalizing it to sum to 1.0. There are an infinite number of 10-dimensional vectors that will exponentiate and normalize to the same 10-dimensional probability distribution -- you just have to add an arbitrary constant c to each value. That's because the mean of the Gaussian has 10 free parameters, one more than the more constrained distribution.

    There are several ways to make the Gaussian "identifiable". One is to fix one of the elements of the mean vector to be 0.0. That's why you see a 9-dimensional mean and covariance matrix: the 10th value is always 0 with no variance.