Search code examples
rquanteda

Interpretation of dfm_weight(scheme='prop') with groups (quanteda)


I'm looking at the different weighting options using the dfm_weight. If I select scheme = 'prop' and I group textstat_frequency by location, what's the proper interpretation of a word in each group?

Say in New York the term career is 0.6 and and in Boston the word team is 4.0, how can I interpret these numbers?

    corp=corpus(df,text_field = "What are the areas that need the most improvement at our company?") %>% 
  dfm(remove_numbers=T,remove_punct=T,remove=c(toRemove,stopwords('english')),ngrams=1:2) %>%
  dfm_weight('prop') %>% 
  dfm_replace(pattern=as.character(lemma$first),replacement = as.character(lemma$X1)) %>% 
  dfm_remove(pattern = c(paste0("^", stopwords("english"), "_"), paste0("_", stopwords("english"), "$")), valuetype = "regex")
freq_weight <- textstat_frequency(corp, n = 10, groups = c("location"))


ggplot(data = freq_weight, aes(x = nrow(freq_weight):1, y = frequency)) +
  geom_bar(stat='identity')+
  facet_wrap(~ group, scales = "free") +
  coord_flip() +
  scale_x_continuous(breaks = nrow(freq_weight):1,
                     labels = freq_weight$feature) +
  labs(x = NULL, y = "Relative frequency")

Solution

  • The proper interpretation is that this is the sum of the original term proportions within document, but summed by group. This is not a very natural interpretation, since it sums proportions and you do not know on how many terms the proportion was based (in absolute frequency) before it was summed.

    quanteda < 1.4 disallowed this, but following a discussion we enabled it (but let the user beware).

    library("quanteda")
    #> Package version: 1.4.3
    corp <- corpus(c("a b b c c", 
                     "a a b", 
                     "b b c",
                     "c c c d"),
                   docvars = data.frame(grp = c(1, 1, 2, 2)))
    dfmat <- dfm(corp) %>%
        dfm_weight(scheme = "prop")
    dfmat
    #> Document-feature matrix of: 4 documents, 4 features (43.8% sparse).
    #> 4 x 4 sparse Matrix of class "dfm"
    #>        features
    #> docs            a         b         c    d
    #>   text1 0.2000000 0.4000000 0.4000000 0   
    #>   text2 0.6666667 0.3333333 0         0   
    #>   text3 0         0.6666667 0.3333333 0   
    #>   text4 0         0         0.7500000 0.25
    

    Now we can compare the textstat_frequency() with and without groups. (Neither makes too much sense.)

    # sum across the corpus
    textstat_frequency(dfmat, groups = NULL)
    #>   feature frequency rank docfreq group
    #> 1       c 1.4833333    1       3   all
    #> 2       b 1.4000000    2       3   all
    #> 3       a 0.8666667    3       2   all
    #> 4       d 0.2500000    4       1   all
    
    # sum across groups
    textstat_frequency(dfmat, groups = "grp")
    #>   feature frequency rank docfreq group
    #> 1       a 0.8666667    1       2     1
    #> 2       b 0.7333333    2       2     1
    #> 3       c 0.4000000    3       1     1
    #> 4       c 1.0833333    1       2     2
    #> 5       b 0.6666667    2       1     2
    #> 6       d 0.2500000    3       1     2
    

    If what you wanted was the relative term frequencies after grouping, then you can first group the dfm and then weight it, like this:

    dfmat2 <- dfm(corp) %>%
        dfm_group(groups = "grp") %>%
        dfm_weight(scheme = "prop")
    
    textstat_frequency(dfmat2, groups = "grp")
    #>   feature frequency rank docfreq group
    #> 1       a 0.3750000    1       1     1
    #> 2       b 0.3750000    1       1     1
    #> 3       c 0.2500000    3       1     1
    #> 4       c 0.5714286    1       1     2
    #> 5       b 0.2857143    2       1     2
    #> 6       d 0.1428571    3       1     2
    

    Now, the term frequencies sum to 1.0 within group, making their interpretation more natural because they were computed on grouped counts, not grouped proportions.