r text latent-semantic-indexing tidytext latent-semantic-analysis

Topic Modelling: LDA , word frequency in each topic and Wordcloud

Question: How can I compute and code the frequency of words in each topic? My goal is to create 'Word Cloud' from each topic.

P.S.> I have no problem with wordcloud.

From the code,

  burnin <- 4000  #We do not collect this.
  iter <- 4000
  thin <- 500
  seed <-list(2017,5,63,100001,765)
  nstart <- 5
  best <- TRUE
  #Number of topics: 
  k <- 4
  LDA_results <-LDA(DTM,k, method="Gibbs", control=list(nstart=nstart,
                           seed = seed, best=best, 
                           burnin = burnin, iter = iter, thin=thin))

Thank you (I try to make the question as concise as possible, so if you need further details, I can add more.)

Solution

If you want to create a wordcloud for each topic, what you want are the top terms for each topic, i.e., the most probable words to be generated from each topic. This probability is called beta; it's the per-topic-per-word probability. The higher this probability beta is, the higher the probability that that word is generated from that topic.

You can get out the beta probabilities in a tidy data frame from your LDA topic model using tidy from tidytext. Let's look at an example dataset and fit a model using just two topics.

library(tidyverse)
library(tidytext)
library(topicmodels)

data("AssociatedPress")
ap_lda <- LDA(AssociatedPress, k = 2, control = list(seed = 1234))

You've fit the model now! Now, we can get out the probabilities.

ap_topics <- tidy(ap_lda, matrix = "beta")

ap_topics
#> # A tibble: 20,946 x 3
#>    topic       term         beta
#>    <int>      <chr>        <dbl>
#>  1     1      aaron 1.686917e-12
#>  2     2      aaron 3.895941e-05
#>  3     1    abandon 2.654910e-05
#>  4     2    abandon 3.990786e-05
#>  5     1  abandoned 1.390663e-04
#>  6     2  abandoned 5.876946e-05
#>  7     1 abandoning 2.454843e-33
#>  8     2 abandoning 2.337565e-05
#>  9     1     abbott 2.130484e-06
#> 10     2     abbott 2.968045e-05
#> # ... with 20,936 more rows

They are all mixed up there. Let's use dplyr to get the top most probable terms for each of the topics.

ap_top_terms <- ap_topics %>%
  group_by(topic) %>%
  top_n(200, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

You can now use this to make a wordcloud (with some reshaping). The beta probability is what you want to correspond to how big the words are.

library(wordcloud)
library(reshape2)

ap_top_terms %>%
  mutate(topic = paste("topic", topic)) %>%
  acast(term ~ topic, value.var = "beta", fill = 0) %>%
  comparison.cloud(colors = c("#F8766D", "#00BFC4"),
                   max.words = 100)