I have a corpus with 2 document variables: group and interest.
I am interested in identifying the key features for a given combination of interest&group (the target) versus the rest of the corpus using textstat_keyness and I want to replicate this for any possible combination of interest and group.
I can easily do this once in the following way:
data_trim@docvars$focus <- 0
data_trim@docvars$focus[data_trim@docvars$group=="One of the Groups"
& data_trim@docvars$interest=="One of the interests"] <- 1 #I created the
keyness <- dfm(data_trim, groups = "focus")
k_sustainability <- textstat_keyness(keyness, target ="1")
however I d like to find an easy way to produce results for any possible combinations of group and target without the need of manually doing this.
I know that textstat_frequency
function allows me to select groups = c("group", "interest")
and produces an output with the most frequent words for any combination of "group" & "interest".
is there any way to do the same with textstat_keyness?
(I show an example of how the textstat_frequency output looks like)
textstat_frequency(dt_tfidf, n = 20, groups = c("group", "interest"), force=TRUE)
feature frequency rank docfreq group
... ... 1 .. group1 & interest1
2 group1 & interest1
3 group1 & interest1
. ....
.
1 group2 & interest1
2 group2 & interest1
. .
.
18 group100 & interest100
19 group100 & interest100
20 group100 & interest100
so I want something similar using textstat_keyness in order to obtain something like this (i.e. having the top 20 scoring features, and the corresponding group alongside identifiable by the columns rank and group as I have in textstat_frequency):
feature chi2 p n_target n_reference rank group
... .. .. ... .. ... 1. group1 & interest1
....
textstat_keyness()
identifies the keywords most associated with a target "document" from a dfm, compared to all other documents. So for your comparison of combinations, you will need first to create a grouped dfm according to the combination of your groups. Then you can loop through the dfm and create a comparison for each group against all others.
Here is how to do it:
library("quanteda")
## Package version: 1.5.2
docvars(data_corpus_irishbudget2010, "govopp") <-
ifelse(docvars(data_corpus_irishbudget2010, "party") %in% c("FF", "Green"),
"Government", "Opposition"
)
dfmatgr <- dfm(data_corpus_irishbudget2010, groups = c("govopp", "party"))
head(dfmatgr, nf = 5)
## Document-feature matrix of: 5 documents, 5 features (16.0% sparse) and 4 docvars.
## features
## docs when i presented the supplementary
## Government.FF 9 90 1 933 7
## Opposition.FG 19 42 1 802 1
## Government.Green 9 33 0 224 0
## Opposition.LAB 25 76 0 856 0
## Opposition.SF 28 31 1 783 2
That creates the groups, which in this case are by government and opposition and party. (Your example may have overlapping groups, but since I cannot reproduce it, I used an example from one of the built-in corpora.)
Now we create a data.frame and loop through the targets. For presentation I've only recorded the top 2 keywords, but you can change this to 20 or whatever you prefer.
df <- data.frame()
for (g in docnames(dfmatgr)) {
df_temp <- head(textstat_keyness(dfmatgr, target = g), 2)
df_temp[["target"]] <- g
df <- rbind(df, df_temp)
}
df
## feature chi2 p n_target n_reference target
## 1 2010 59.14626 1.465494e-14 41 14 Government.FF
## 2 scheme 51.30431 7.910339e-13 31 8 Government.FF
## 11 taoiseach 116.89154 0.000000e+00 49 18 Opposition.FG
## 21 not 39.15371 3.917181e-10 127 259 Opposition.FG
## 12 enterprising 102.69108 0.000000e+00 9 0 Government.Green
## 22 we 67.76028 2.220446e-16 97 521 Government.Green
## 13 fianna 66.59467 3.330669e-16 47 25 Opposition.LAB
## 23 fáil 65.23011 6.661338e-16 44 22 Opposition.LAB
## 14 state 59.81864 1.043610e-14 42 32 Opposition.SF
## 24 care 32.97959 9.313121e-09 18 10 Opposition.SF
Note: I strongly discourage accessing docvars using e.g. data_trim@docvars$focus
-- use docvars()
instead since this will always work. If we change the object structure, your code will break. (And in forthcoming v2, we make it possible to access docvars using $
.)