Search code examples
rquanteda

Understanding how dfm_groups works with no group added


Building off of this question: Interpretation of dfm_weight(scheme='prop') with groups (quanteda)

If i have the function:

     plot_topterms = function(data,text_field,n,...){

  corp=corpus(data,text_field = text_field) %>% 
    dfm(remove_numbers=T,remove_punct=T,remove=c(stopwords('english')),ngrams=1:2) %>%
    dfm_weight(scheme ='prop') %>% 
    dfm_group(groups=...) %>% 
    dfm_replace(pattern=as.character(lemma$first),replacement = as.character(lemma$X1)) %>% 
    dfm_remove(pattern = c(paste0("^", stopwords("english"), "_"), paste0("_", stopwords("english"), "$")), valuetype = "regex") %>% 
    dfm_remove(toRemove)
  freq_weight <- textstat_frequency(corp, n = n)

  ggplot(data = freq_weight, aes(x = nrow(freq_weight):1, y = frequency)) +
    geom_bar(stat='identity')+
    facet_wrap(~ group, scales = "free") +
    coord_flip() +
    scale_x_continuous(breaks = nrow(freq_weight):1,
                       labels = freq_weight$feature) +
    #scale_y_continuous(labels = scales::percent)+
    theme(text = element_text(size=20))+
    labs(x = NULL, y = "Relative frequency")
}

and I don't pass in a grouping variable so I do something like:

plot_topterms(df,textField,n=10)

I get an output with the group variable equal to all. This should be equivalent to not even having the dfm_group line correct? And if thats the case, if I have a relative frequency of 60 for the word fun, does this mean that 60% of all documents contain that word?


Solution

  • Your interpretation of the "all" group is correct. The effect of not specifying groups in textstat_frequency() is that the group will default to "all". In your function, you never pass a groups argument in the call to this function, so that it will always be "all", even if you have already grouped the dfm through the dfm_group() call inside your function plot_topterms().

    A value of 60 for a feature in this plot would mean that the sum of the relative term frequencies (within document) for this feature is 60. If you look at the question you reference above then you will see how this works for the simple example. a's relative frequency in text1 was 0.20 and 0.67 in text2, so the textstat_frequency() sums those two into 0.87. Your 60 is analogous to this 0.87.

    This is not the same as document frequency, which is the number of documents in which a feature occurred (at least once). If you want to know the features' document frequencies (which was your interpretation), then you should be plotting docfreq from the textstat_frequency return, not frequency.

    I would note however that plot_topterms() is not a well-designed function.

    • It relies on several variables that are not local to the functions, namely toRemove and lemma.

    • It will not correctly pass the ... in the dfm_group() call. You should explicitly specify a groups argument in the function signature instead.

    If we were designing a new function for the package, we would create a new function textplot_frequency() that plotted a return from textstat_frequency() that basically implemented just the ggplot() call after the user has built the textstat_frequency object. This could make smarter use of the group variable built in to every textstat_frequency object, so that those in which the only group is "all" will plot this as a single facet.