This is a question about LDA and the application LDAvis in R. As this is my first time using this package I would appreciate any help that could help my research.
I want to be able to view the documents that have been defined by each topic based on probability. I am using survey data and I am looking at the comment section and have defined each of them as documents.
I am going to use the example "A topic model for movie reviews" by cpsievert, as this is very similar to my code. The full code can be found in the following link: Visit
http://cpsievert.github.io/LDAvis/reviews/reviews.html
I have got to the stage of fitting the model using the LDA model based on the following code:
set.seed(123)
fit <- lda.collapsed.gibbs.sampler(documents = documents, K = K, vocab = vocab,
num.iterations = G, alpha = alpha,
eta = eta, initial = NULL, burnin = 0,
compute.log.likelihood = TRUE)
And then used the LDAvis to create the interactive html, with the following code:
json <- createJSON(phi = MovieReviews$phi,
theta = MovieReviews$theta,
doc.length = MovieReviews$doc.length,
vocab = MovieReviews$vocab,
term.frequency = MovieReviews$term.frequency)
Now based on the interactive html I have defined the topic based on the frequency terms. There is an example of this on Movie Reviews, which can be found with following link:
http://cpsievert.github.io/LDAvis/reviews/vis/#topic=7&lambda=0.6&term=
This topic can be defined as comedies for Movie Reviews.
So if in this example "topic7" is known as comedies, how can you view the reviews, which are most probable to be defined by this topic?
I would like to know, how would I define documents based on topic7 and then be able to view them, say using:
View(MovieReviwes$Topic7)
I apologise if this question is broad and long, but if someone could answer it by using the example I have given in the link, this would help greatly. Thanks in advance.
I think you may not fully understand what lda's do and how they work. The lda model will generate a list of k topics, and then tell you which words were assigned to which topics and their respective probabilities for being assigned to each of the various topics. It sounds like what you're really trying to do is perform document/topic classification rather than word/topic classification, and if that's the case then the lda
package isn't going to suit your needs.
If you wanted a really dirty method of document classification based on the lda
object I guess you could just return the name of the topic with the greatest number of words assigned to it for each document, though I'd imagine that you'd run into issues if there were ties (the probability of ties increases as k increases and the number of documents increases).
EDIT: The quick and dirty way as requested:
sums <- fit$document_sums
sums <- t(sums)
sums <- as.data.frame(sums)
topics <- colnames(sums)[max.col(sums,ties.method="first")]
topics <- t(topics)
sums$topics <- topics