Search code examples
rtextnlpquanteda

quanteda: producing an output for several targets using textstat_keyness similarly to textstat_frequency


I have a corpus with 2 document variables: group and interest.

I am interested in identifying the key features for a given combination of interest&group (the target) versus the rest of the corpus using textstat_keyness and I want to replicate this for any possible combination of interest and group.

I can easily do this once in the following way:

data_trim@docvars$focus <- 0 
data_trim@docvars$focus[data_trim@docvars$group=="One of the Groups" 
& data_trim@docvars$interest=="One of the interests"] <- 1 #I created the


keyness <- dfm(data_trim, groups = "focus")
k_sustainability <- textstat_keyness(keyness, target ="1")

however I d like to find an easy way to produce results for any possible combinations of group and target without the need of manually doing this.

I know that textstat_frequency function allows me to select groups = c("group", "interest") and produces an output with the most frequent words for any combination of "group" & "interest". is there any way to do the same with textstat_keyness?

(I show an example of how the textstat_frequency output looks like)

textstat_frequency(dt_tfidf, n = 20, groups = c("group", "interest"), force=TRUE)

feature     frequency   rank      docfreq      group
 ...          ...        1          ..         group1 & interest1
                         2                     group1 & interest1
                         3                     group1 & interest1
                         .                     ....
                         .
                         1                     group2 & interest1
                         2                     group2 & interest1
                         .                      .
                         .
                         18                     group100 & interest100
                         19                     group100 & interest100
                         20                     group100 & interest100      

so I want something similar using textstat_keyness in order to obtain something like this (i.e. having the top 20 scoring features, and the corresponding group alongside identifiable by the columns rank and group as I have in textstat_frequency):

feature    chi2         p           n_target        n_reference rank    group
 ...         ..             .. ...      ..           ...          1.    group1 & interest1
  ....

Solution

  • textstat_keyness() identifies the keywords most associated with a target "document" from a dfm, compared to all other documents. So for your comparison of combinations, you will need first to create a grouped dfm according to the combination of your groups. Then you can loop through the dfm and create a comparison for each group against all others.

    Here is how to do it:

    library("quanteda")
    ## Package version: 1.5.2
    
    docvars(data_corpus_irishbudget2010, "govopp") <-
      ifelse(docvars(data_corpus_irishbudget2010, "party") %in% c("FF", "Green"),
        "Government", "Opposition"
      )
    dfmatgr <- dfm(data_corpus_irishbudget2010, groups = c("govopp", "party"))
    head(dfmatgr, nf = 5)
    ## Document-feature matrix of: 5 documents, 5 features (16.0% sparse) and 4 docvars.
    ##                   features
    ## docs               when  i presented the supplementary
    ##   Government.FF       9 90         1 933             7
    ##   Opposition.FG      19 42         1 802             1
    ##   Government.Green    9 33         0 224             0
    ##   Opposition.LAB     25 76         0 856             0
    ##   Opposition.SF      28 31         1 783             2
    

    That creates the groups, which in this case are by government and opposition and party. (Your example may have overlapping groups, but since I cannot reproduce it, I used an example from one of the built-in corpora.)

    Now we create a data.frame and loop through the targets. For presentation I've only recorded the top 2 keywords, but you can change this to 20 or whatever you prefer.

    df <- data.frame()
    for (g in docnames(dfmatgr)) {
      df_temp <- head(textstat_keyness(dfmatgr, target = g), 2)
      df_temp[["target"]] <- g
      df <- rbind(df, df_temp)
    }
    
    df
    ##         feature      chi2            p n_target n_reference           target
    ## 1          2010  59.14626 1.465494e-14       41          14    Government.FF
    ## 2        scheme  51.30431 7.910339e-13       31           8    Government.FF
    ## 11    taoiseach 116.89154 0.000000e+00       49          18    Opposition.FG
    ## 21          not  39.15371 3.917181e-10      127         259    Opposition.FG
    ## 12 enterprising 102.69108 0.000000e+00        9           0 Government.Green
    ## 22           we  67.76028 2.220446e-16       97         521 Government.Green
    ## 13       fianna  66.59467 3.330669e-16       47          25   Opposition.LAB
    ## 23         fáil  65.23011 6.661338e-16       44          22   Opposition.LAB
    ## 14        state  59.81864 1.043610e-14       42          32    Opposition.SF
    ## 24         care  32.97959 9.313121e-09       18          10    Opposition.SF
    

    Note: I strongly discourage accessing docvars using e.g. data_trim@docvars$focus -- use docvars() instead since this will always work. If we change the object structure, your code will break. (And in forthcoming v2, we make it possible to access docvars using $.)