r dictionary text-mining corpus quanteda

Quanteda: how to create identically-featured dfms from a list of words

I run a randomforest on n-gram matrix of articles, because I would like to classify it to 2 categories. As a result of RF I received a list of important variables.

Now I would like to run random forest only on the selected first n features and then use the same features for predicting new classification. For that I need to create dfm only for most important variables (from RF). How can I create a dictionary from a list of those important variables?

The relevant part of the code... after creating a dictionary I have only one entry in it. How to create it properly?

forestModel <-
  randomForest(x =  as.matrix(myStemMat),y=as.factor(classVect),
               ntree = 1000 )

impVariables <-
  data.frame(important = as.matrix(importance(forestModel)))

impVariables <-
  impVariables %>% mutate(impVar = row.names(impVariables)) %>% 
  arrange(desc(MeanDecreaseGini)) %>% 
  top_n(1000, wt = MeanDecreaseGini) %>% 
  select(impVar) %>% as.list() %>% dictionary()

myStemMat <-
  dfm(
    mycorpus,
    dictionary=impVariables,
    #   remove = stopwordsPL,
    stem = TRUE,
    remove_punct = TRUE,
    ngrams=c(1,2)
  )

In brief, when I have a list of strings, of words, n-grams, how can I create a dictionary so that I can use it in the dfm() function to generate term matrix?

Here is a link to complete code "reproducible example" and data it uses. https://www.dropbox.com/s/3oe1tcfcauer0wf/text_data.zip?dl=0

Solution

You should read the ?dictionary carefully, since this not designed to be a set for feature selection (although it can be), but rather to create equivalence classes among values assigned to dictionary keys.

If your impVariables is a character vector of features, then you should be able to use these commands to perform the selection you want:

toks <- 
    tokens(mycorpus, remove_punct = TRUE) %>%
    tokens_select(impVariables, padding = TRUE) %>%
    tokens_wordstem() %>%
    tokens_ngrams(n = 1:2)

dfm(toks)

where the last command produces a document-feature matrix of just the stemmed, ngram features that were selected in the top features from your random forest model. Note that the padding = TRUE will prevent ngrams from forming that were never adjacent in your original text. If you don't care about that, set it to FALSE (the default).

ADDED:

To select the columns of the dfm from a character vector of selection words, here's two methods we can use.

We will work with these sample objects:

# two sample texts and their dfm representations
txt1 <- c(d1 = "a b c f g h",
          d2 = "a a c c d f f f")
txt2 <- c(d1 = "c c d f g h",
          d2 = "b b d i j")
(dfm1 <- dfm(txt1))
# Document-feature matrix of: 2 documents, 7 features (28.6% sparse).
# 2 x 7 sparse Matrix of class "dfmSparse"
#     features
# docs a b c f g h d
#   d1 1 1 1 1 1 1 0
#   d2 2 0 2 3 0 0 1

(dfm2 <- dfm(txt2))
# Document-feature matrix of: 2 documents, 8 features (43.8% sparse).
# 2 x 8 sparse Matrix of class "dfmSparse"
#     features
# docs c d f g h b i j
#   d1 2 1 1 1 1 0 0 0
#   d2 0 1 0 0 0 2 1 1

impVariables <- c("a", "c", "e", "z")

First Method: Create a dfm and select on that using dfm_select()

Here, we are creating a dfm from the character vector of your features, just so that we register them as features, because of the way that dfm_select() works when the selection object is a dfm.

impVariablesDfm <- dfm(paste(impVariables, collapse = " "))
dfm_select(dfm1, impVariablesDfm)
# Document-feature matrix of: 2 documents, 4 features (50% sparse).
# 2 x 4 sparse Matrix of class "dfmSparse"
#     features
# docs a c e z
#   d1 1 1 0 0
#   d2 2 2 0 0

dfm_select(dfm2, impVariablesDfm)
# Document-feature matrix of: 2 documents, 4 features (87.5% sparse).
# 2 x 4 sparse Matrix of class "dfmSparse"
#     features
# docs a c e z
#   d1 0 2 0 0
#   d2 0 0 0 0

Second Method: Create a dictionary and select on that using dfm_lookup()

Let's create a helper function to create a dictionary from a character vector:

# make a dictionary where each key = its value
char2dictionary <- function(x) {
    result <- as.list(x)  # make the vector into a list
    names(result) <- x
    dictionary(result)
}

Now using dfm lookup, we get only the keys, even ones that were not observed:

dfm_lookup(dfm1, dictionary = char2dictionary(impVariables))
# Document-feature matrix of: 2 documents, 4 features (50% sparse).
# 2 x 4 sparse Matrix of class "dfmSparse"
#     features
# docs a c e z
#   d1 1 1 0 0
#   d2 2 2 0 0

dfm_lookup(dfm2, dictionary = char2dictionary(impVariables))
# Document-feature matrix of: 2 documents, 4 features (87.5% sparse).
# 2 x 4 sparse Matrix of class "dfmSparse"
#     features
# docs a c e z
#   d1 0 2 0 0
#   d2 0 0 0 0

Note: (but the first one at least will work with v0.9.9.65):

packageVersion("quanteda")
# [1] ‘0.9.9.85’