Search code examples
rggplot2ldatibbletopic-modeling

Specify the output per topic to a specific number of words


After conducting a lda topic modeling in R some words have the same beta value. They are therefore listed together when plotting the results. This leads to overlapping and sometimes unreadable results.

Is there a way to limit the amount of words displayed per topic to a specific number? In my dummy data set, some words have the same beta values. I would like to tell R that it should only display 3 words per topic, or any specified number according to necessity.

Currently the code I am using to plot the results looks like this:

top_terms %>% # take the top terms
      group_by(topic) %>%
      mutate(top_term = term[which.max(beta)]) %>% 
      mutate(term = reorder(term, beta)) %>% 
      head(3) %>% # I tried this but that only works for the first topic
      ggplot(aes(term, beta, fill = factor(topic))) + 
      geom_col(show.legend = FALSE) + 
      facet_wrap(~ top_term, scales = "free") + 
      labs(x = NULL, y = "Beta") + # no x label, change y label
      coord_flip() # turn bars sideways

I tried to solve the issue with head(3) which worked, but only for the first topic. What I would need is something similar, which doesn't ignore all the other topics.

Best regards. Stay safe, stay healthy.

Note: top_terms is a tibble.

Sample data:

topic   term      beta
(int)   (chr)     (dbl) 
1       book      0,9876 
1       page      0,9765
1       chapter   0,9654
1       author    0,9654
2       sports    0,8765
2       soccer    0,8654
2       champions   0,8543
2       victory   0,8543
3       music     0,9543
3       song      0,8678
3       artist    0,7231
3       concert   0,7231
4       movie     0,9846
4       cinema    0,9647
4       cast      0,8878
4       story     0,8878 

dput of sample data

top_terms <- structure(list(topic = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 
  3L, 3L, 3L, 4L, 4L, 4L, 4L), term = c("book", "page", "chapter", 
    "author", "sports", "soccer", "champions", "victory", "music", 
    "song", "artist", "concert", "movie", "cinema", "cast", "story"
  ), beta = c(0.9876, 0.9765, 0.9654, 0.9654, 0.8765, 0.8654, 0.8543, 
    0.8543, 0.9543, 0.8678, 0.7231, 0.7231, 0.9846, 0.9647, 0.8878, 
    0.8878)), row.names = c(NA, -16L), class = "data.frame")

Solution

  • slice_head after adding an group_by on grouping field, will do the job here instead of head

    top_terms %>% # take the top terms
      group_by(topic) %>%
      mutate(top_term = term[which.max(beta)]) %>% 
      mutate(term = reorder(term, beta)) %>% 
      group_by(top_term) %>%
      slice_head(n=3) %>% # I tried this but that only works for the first topic
      ggplot(aes(term, beta, fill = factor(topic))) + 
      geom_col(show.legend = FALSE) + 
      facet_wrap(~ top_term, scales = "free") + 
      labs(x = NULL, y = "Beta") + # no x label, change y label
      coord_flip()
    

    enter image description here