Search code examples
rnlpquantedafrequency-analysis

How to find and plot frequencies of multiple phrases totalled?


I have corpus and I am trying to find the frequencies of multiple phrases totalled by year and plot this. For example, if the phrase "American economy" and "Canadian economy" is mentioned 2 times each in 2004, I would want this to give a frequency of 4 in 2004.

I have managed to do this for single tokens, but am having trouble trying it for phrases. This is the code I used to do for single tokens.

a_corpus <- corpus(df, text = "text")

my_dict <- dictionary(list(america = c("America", "President")))
                      
freq_grouped_creators <- textstat_frequency(dfm(tokens(a_corpus)), 
                               groups = a_corpus$Year)

freq_word_creators <- subset(freq_grouped_creators, freq_grouped_creators$feature %in% my_dict$america)

# collapsing rows by year to total frequencies for tokens
freq_word_creators_2 <- freq_word_creators %>% 
                           group_by(group) %>%
                           summarize(Sum_frequency = sum(frequency))

# plotting
ggplot(freq_word_creators_2, aes(x = group, y = 
    Sum_frequency)) +
    geom_point() +
    scale_y_continuous(limits = c(0, 300), breaks = c(seq(0, 300, 30))) +
    xlab(NULL) +
    ylab("Frequency") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

Solution

  • No need to manipulate the frequencies in dplyr - a simpler approach is to select the phrases, then to create a dfm that you covert to a data.frame for use directly with ggplot2.

    library("quanteda")
    ## Package version: 3.0.9000
    ## Unicode version: 13.0
    ## ICU version: 69.1
    ## Parallel computing: 12 of 12 threads used.
    ## See https://quanteda.io for tutorials and examples.
    library("quanteda.textstats")
    
    a_corpus <- tail(data_corpus_inaugural, 10)
    
    economic_phrases <- c("middle class", "social security", "strong economy")
    toks <- tokens(a_corpus)
    toks <- tokens_compound(toks, phrase(economic_phrases), concatenator = " ") %>%
      tokens_keep(economic_phrases)
    dfmat <- dfm(toks)
    dfmat
    ## Document-feature matrix of: 10 documents, 2 features (65.00% sparse) and 4 docvars.
    ##               features
    ## docs           middle class social security
    ##   1985-Reagan             0               0
    ##   1989-Bush               0               0
    ##   1993-Clinton            0               0
    ##   1997-Clinton            2               0
    ##   2001-Bush               0               1
    ##   2005-Bush               0               1
    ## [ reached max_ndoc ... 4 more documents ]
    
    freq_word_creators_2 <- data.frame(convert(dfmat, to = "data.frame"), Year = dfmat$Year)
    
    # plotting
    library("ggplot2")
    ggplot(freq_word_creators_2, aes(x = Year, y = middle.class)) +
      geom_point() +
      # scale_y_continuous(limits = c(0, 300), breaks = c(seq(0, 300, 30))) +
      xlab(NULL) +
      ylab("Frequency") +
      theme(axis.text.x = element_text(angle = 90, hjust = 1))