I run a randomforest on n-gram matrix of articles, because I would like to classify it to 2 categories. As a result of RF I received a list of important variables.
Now I would like to run random forest only on the selected first n features and then use the same features for predicting new classification. For that I need to create dfm only for most important variables (from RF). How can I create a dictionary from a list of those important variables?
The relevant part of the code... after creating a dictionary I have only one entry in it. How to create it properly?
forestModel <-
randomForest(x = as.matrix(myStemMat),y=as.factor(classVect),
ntree = 1000 )
impVariables <-
data.frame(important = as.matrix(importance(forestModel)))
impVariables <-
impVariables %>% mutate(impVar = row.names(impVariables)) %>%
arrange(desc(MeanDecreaseGini)) %>%
top_n(1000, wt = MeanDecreaseGini) %>%
select(impVar) %>% as.list() %>% dictionary()
myStemMat <-
# remove = stopwordsPL,
stem = TRUE,
remove_punct = TRUE,
In brief, when I have a list of strings, of words, n-grams, how can I create a dictionary so that I can use it in the dfm()
function to generate term matrix?
Here is a link to complete code "reproducible example" and data it uses. https://www.dropbox.com/s/3oe1tcfcauer0wf/text_data.zip?dl=0
You should read the ?dictionary
carefully, since this not designed to be a set for feature selection (although it can be), but rather to create equivalence classes among values assigned to dictionary keys.
If your impVariables
is a character vector of features, then you should be able to use these commands to perform the selection you want:
toks <-
tokens(mycorpus, remove_punct = TRUE) %>%
tokens_select(impVariables, padding = TRUE) %>%
tokens_wordstem() %>%
tokens_ngrams(n = 1:2)
where the last command produces a document-feature matrix of just the stemmed, ngram features that were selected in the top features from your random forest model. Note that the padding = TRUE
will prevent ngrams from forming that were never adjacent in your original text. If you don't care about that, set it to FALSE
(the default).
To select the columns of the dfm from a character vector of selection words, here's two methods we can use.
We will work with these sample objects:
# two sample texts and their dfm representations
txt1 <- c(d1 = "a b c f g h",
d2 = "a a c c d f f f")
txt2 <- c(d1 = "c c d f g h",
d2 = "b b d i j")
(dfm1 <- dfm(txt1))
# Document-feature matrix of: 2 documents, 7 features (28.6% sparse).
# 2 x 7 sparse Matrix of class "dfmSparse"
# features
# docs a b c f g h d
# d1 1 1 1 1 1 1 0
# d2 2 0 2 3 0 0 1
(dfm2 <- dfm(txt2))
# Document-feature matrix of: 2 documents, 8 features (43.8% sparse).
# 2 x 8 sparse Matrix of class "dfmSparse"
# features
# docs c d f g h b i j
# d1 2 1 1 1 1 0 0 0
# d2 0 1 0 0 0 2 1 1
impVariables <- c("a", "c", "e", "z")
First Method: Create a dfm and select on that using dfm_select()
Here, we are creating a dfm from the character vector of your features, just so that we register them as features, because of the way that dfm_select()
works when the selection object is a dfm.
impVariablesDfm <- dfm(paste(impVariables, collapse = " "))
dfm_select(dfm1, impVariablesDfm)
# Document-feature matrix of: 2 documents, 4 features (50% sparse).
# 2 x 4 sparse Matrix of class "dfmSparse"
# features
# docs a c e z
# d1 1 1 0 0
# d2 2 2 0 0
dfm_select(dfm2, impVariablesDfm)
# Document-feature matrix of: 2 documents, 4 features (87.5% sparse).
# 2 x 4 sparse Matrix of class "dfmSparse"
# features
# docs a c e z
# d1 0 2 0 0
# d2 0 0 0 0
Second Method: Create a dictionary and select on that using dfm_lookup()
Let's create a helper function to create a dictionary from a character vector:
# make a dictionary where each key = its value
char2dictionary <- function(x) {
result <- as.list(x) # make the vector into a list
names(result) <- x
Now using dfm lookup, we get only the keys, even ones that were not observed:
dfm_lookup(dfm1, dictionary = char2dictionary(impVariables))
# Document-feature matrix of: 2 documents, 4 features (50% sparse).
# 2 x 4 sparse Matrix of class "dfmSparse"
# features
# docs a c e z
# d1 1 1 0 0
# d2 2 2 0 0
dfm_lookup(dfm2, dictionary = char2dictionary(impVariables))
# Document-feature matrix of: 2 documents, 4 features (87.5% sparse).
# 2 x 4 sparse Matrix of class "dfmSparse"
# features
# docs a c e z
# d1 0 2 0 0
# d2 0 0 0 0
Note: (but the first one at least will work with v0.9.9.65):
# [1] ‘’