Question: How can I compute and code the frequency of words in each topic? My goal is to create 'Word Cloud' from each topic.
P.S.> I have no problem with wordcloud.
From the code,
burnin <- 4000 #We do not collect this.
iter <- 4000
thin <- 500
seed <-list(2017,5,63,100001,765)
nstart <- 5
best <- TRUE
#Number of topics:
k <- 4
LDA_results <-LDA(DTM,k, method="Gibbs", control=list(nstart=nstart,
seed = seed, best=best,
burnin = burnin, iter = iter, thin=thin))
Thank you (I try to make the question as concise as possible, so if you need further details, I can add more.)
If you want to create a wordcloud for each topic, what you want are the top terms for each topic, i.e., the most probable words to be generated from each topic. This probability is called beta
; it's the per-topic-per-word probability. The higher this probability beta is, the higher the probability that that word is generated from that topic.
You can get out the beta
probabilities in a tidy data frame from your LDA topic model using tidy
from tidytext. Let's look at an example dataset and fit a model using just two topics.
library(tidyverse)
library(tidytext)
library(topicmodels)
data("AssociatedPress")
ap_lda <- LDA(AssociatedPress, k = 2, control = list(seed = 1234))
You've fit the model now! Now, we can get out the probabilities.
ap_topics <- tidy(ap_lda, matrix = "beta")
ap_topics
#> # A tibble: 20,946 x 3
#> topic term beta
#> <int> <chr> <dbl>
#> 1 1 aaron 1.686917e-12
#> 2 2 aaron 3.895941e-05
#> 3 1 abandon 2.654910e-05
#> 4 2 abandon 3.990786e-05
#> 5 1 abandoned 1.390663e-04
#> 6 2 abandoned 5.876946e-05
#> 7 1 abandoning 2.454843e-33
#> 8 2 abandoning 2.337565e-05
#> 9 1 abbott 2.130484e-06
#> 10 2 abbott 2.968045e-05
#> # ... with 20,936 more rows
They are all mixed up there. Let's use dplyr to get the top most probable terms for each of the topics.
ap_top_terms <- ap_topics %>%
group_by(topic) %>%
top_n(200, beta) %>%
ungroup() %>%
arrange(topic, -beta)
You can now use this to make a wordcloud (with some reshaping). The beta
probability is what you want to correspond to how big the words are.
library(wordcloud)
library(reshape2)
ap_top_terms %>%
mutate(topic = paste("topic", topic)) %>%
acast(term ~ topic, value.var = "beta", fill = 0) %>%
comparison.cloud(colors = c("#F8766D", "#00BFC4"),
max.words = 100)