Search code examples
rdictionarytext-miningcorpusquanteda

Quanteda: how to create identically-featured dfms from a list of words


I run a randomforest on n-gram matrix of articles, because I would like to classify it to 2 categories. As a result of RF I received a list of important variables.

Now I would like to run random forest only on the selected first n features and then use the same features for predicting new classification. For that I need to create dfm only for most important variables (from RF). How can I create a dictionary from a list of those important variables?

The relevant part of the code... after creating a dictionary I have only one entry in it. How to create it properly?

forestModel <-
  randomForest(x =  as.matrix(myStemMat),y=as.factor(classVect),
               ntree = 1000 )

impVariables <-
  data.frame(important = as.matrix(importance(forestModel)))

impVariables <-
  impVariables %>% mutate(impVar = row.names(impVariables)) %>% 
  arrange(desc(MeanDecreaseGini)) %>% 
  top_n(1000, wt = MeanDecreaseGini) %>% 
  select(impVar) %>% as.list() %>% dictionary()

myStemMat <-
  dfm(
    mycorpus,
    dictionary=impVariables,
    #   remove = stopwordsPL,
    stem = TRUE,
    remove_punct = TRUE,
    ngrams=c(1,2)
  )

In brief, when I have a list of strings, of words, n-grams, how can I create a dictionary so that I can use it in the dfm() function to generate term matrix?

Here is a link to complete code "reproducible example" and data it uses. https://www.dropbox.com/s/3oe1tcfcauer0wf/text_data.zip?dl=0


Solution

  • You should read the ?dictionary carefully, since this not designed to be a set for feature selection (although it can be), but rather to create equivalence classes among values assigned to dictionary keys.

    If your impVariables is a character vector of features, then you should be able to use these commands to perform the selection you want:

    toks <- 
        tokens(mycorpus, remove_punct = TRUE) %>%
        tokens_select(impVariables, padding = TRUE) %>%
        tokens_wordstem() %>%
        tokens_ngrams(n = 1:2)
    
    dfm(toks)
    

    where the last command produces a document-feature matrix of just the stemmed, ngram features that were selected in the top features from your random forest model. Note that the padding = TRUE will prevent ngrams from forming that were never adjacent in your original text. If you don't care about that, set it to FALSE (the default).

    ADDED:

    To select the columns of the dfm from a character vector of selection words, here's two methods we can use.

    We will work with these sample objects:

    # two sample texts and their dfm representations
    txt1 <- c(d1 = "a b c f g h",
              d2 = "a a c c d f f f")
    txt2 <- c(d1 = "c c d f g h",
              d2 = "b b d i j")
    (dfm1 <- dfm(txt1))
    # Document-feature matrix of: 2 documents, 7 features (28.6% sparse).
    # 2 x 7 sparse Matrix of class "dfmSparse"
    #     features
    # docs a b c f g h d
    #   d1 1 1 1 1 1 1 0
    #   d2 2 0 2 3 0 0 1
    
    (dfm2 <- dfm(txt2))
    # Document-feature matrix of: 2 documents, 8 features (43.8% sparse).
    # 2 x 8 sparse Matrix of class "dfmSparse"
    #     features
    # docs c d f g h b i j
    #   d1 2 1 1 1 1 0 0 0
    #   d2 0 1 0 0 0 2 1 1
    
    impVariables <- c("a", "c", "e", "z")
    

    First Method: Create a dfm and select on that using dfm_select()

    Here, we are creating a dfm from the character vector of your features, just so that we register them as features, because of the way that dfm_select() works when the selection object is a dfm.

    impVariablesDfm <- dfm(paste(impVariables, collapse = " "))
    dfm_select(dfm1, impVariablesDfm)
    # Document-feature matrix of: 2 documents, 4 features (50% sparse).
    # 2 x 4 sparse Matrix of class "dfmSparse"
    #     features
    # docs a c e z
    #   d1 1 1 0 0
    #   d2 2 2 0 0
    
    dfm_select(dfm2, impVariablesDfm)
    # Document-feature matrix of: 2 documents, 4 features (87.5% sparse).
    # 2 x 4 sparse Matrix of class "dfmSparse"
    #     features
    # docs a c e z
    #   d1 0 2 0 0
    #   d2 0 0 0 0
    

    Second Method: Create a dictionary and select on that using dfm_lookup()

    Let's create a helper function to create a dictionary from a character vector:

    # make a dictionary where each key = its value
    char2dictionary <- function(x) {
        result <- as.list(x)  # make the vector into a list
        names(result) <- x
        dictionary(result)
    }
    

    Now using dfm lookup, we get only the keys, even ones that were not observed:

    dfm_lookup(dfm1, dictionary = char2dictionary(impVariables))
    # Document-feature matrix of: 2 documents, 4 features (50% sparse).
    # 2 x 4 sparse Matrix of class "dfmSparse"
    #     features
    # docs a c e z
    #   d1 1 1 0 0
    #   d2 2 2 0 0
    
    dfm_lookup(dfm2, dictionary = char2dictionary(impVariables))
    # Document-feature matrix of: 2 documents, 4 features (87.5% sparse).
    # 2 x 4 sparse Matrix of class "dfmSparse"
    #     features
    # docs a c e z
    #   d1 0 2 0 0
    #   d2 0 0 0 0
    

    Note: (but the first one at least will work with v0.9.9.65):

    packageVersion("quanteda")
    # [1] ‘0.9.9.85’