Search code examples
rtextlatent-semantic-indexingtidytextlatent-semantic-analysis

Topic Modelling: LDA , word frequency in each topic and Wordcloud


Question: How can I compute and code the frequency of words in each topic? My goal is to create 'Word Cloud' from each topic.

P.S.> I have no problem with wordcloud.

From the code,

  burnin <- 4000  #We do not collect this.
  iter <- 4000
  thin <- 500
  seed <-list(2017,5,63,100001,765)
  nstart <- 5
  best <- TRUE
  #Number of topics: 
  k <- 4
  LDA_results <-LDA(DTM,k, method="Gibbs", control=list(nstart=nstart,
                           seed = seed, best=best, 
                           burnin = burnin, iter = iter, thin=thin))

Thank you (I try to make the question as concise as possible, so if you need further details, I can add more.)


Solution

  • If you want to create a wordcloud for each topic, what you want are the top terms for each topic, i.e., the most probable words to be generated from each topic. This probability is called beta; it's the per-topic-per-word probability. The higher this probability beta is, the higher the probability that that word is generated from that topic.

    You can get out the beta probabilities in a tidy data frame from your LDA topic model using tidy from tidytext. Let's look at an example dataset and fit a model using just two topics.

    library(tidyverse)
    library(tidytext)
    library(topicmodels)
    
    data("AssociatedPress")
    ap_lda <- LDA(AssociatedPress, k = 2, control = list(seed = 1234))
    

    You've fit the model now! Now, we can get out the probabilities.

    ap_topics <- tidy(ap_lda, matrix = "beta")
    
    ap_topics
    #> # A tibble: 20,946 x 3
    #>    topic       term         beta
    #>    <int>      <chr>        <dbl>
    #>  1     1      aaron 1.686917e-12
    #>  2     2      aaron 3.895941e-05
    #>  3     1    abandon 2.654910e-05
    #>  4     2    abandon 3.990786e-05
    #>  5     1  abandoned 1.390663e-04
    #>  6     2  abandoned 5.876946e-05
    #>  7     1 abandoning 2.454843e-33
    #>  8     2 abandoning 2.337565e-05
    #>  9     1     abbott 2.130484e-06
    #> 10     2     abbott 2.968045e-05
    #> # ... with 20,936 more rows
    

    They are all mixed up there. Let's use dplyr to get the top most probable terms for each of the topics.

    ap_top_terms <- ap_topics %>%
      group_by(topic) %>%
      top_n(200, beta) %>%
      ungroup() %>%
      arrange(topic, -beta)
    

    You can now use this to make a wordcloud (with some reshaping). The beta probability is what you want to correspond to how big the words are.

    library(wordcloud)
    library(reshape2)
    
    ap_top_terms %>%
      mutate(topic = paste("topic", topic)) %>%
      acast(term ~ topic, value.var = "beta", fill = 0) %>%
      comparison.cloud(colors = c("#F8766D", "#00BFC4"),
                       max.words = 100)