Search code examples
rrandom-forestlarge-data

Easily specify which dummy variables to be used in a random forest with many dummy variables [R]


I apologize in advance that this is such a simple question, but I've been having a very hard time figuring it out with google and stack exchange searches.

I have a dataset which I'd like to run a random forest on. Some of the variables are factors with more than 32 levels, so I've converted them to dummy variables in order to run a random forest. The problem is that this has left me with 1000+ variables, not all of which I want to use in my random forest, though most of which I would like to use.

My random forest code would look like this, except with waaaay too many dummy variables for me to reasonably list by hand.

fit <- randomForest(result ~ dummy_1 + dummy_2 + dummy_3..., data=df, importance=TRUE, ntree=2000)

Essentially my question is if there is a way to specify large ranges of columns in a random forest without listing them by name. I have tried running model.matrix within the random forest command, and trying to specify a range of columns using df[1:34,] etc, but neither of these methods have worked.

Thank you in advance!

e: I suppose just dropping the columns and making a new dataframe could work, but is there a good alternative?


Solution

  • You can exclude variables by changing what's delivered to the function in the data argument.

    exclude_cols <- c('dummy_48','dummy_50','other_var_to_be_dropped')
    fit <- randomForest(result ~ ., 
                        data=df[ !names(df) %in% exclude_cols ] , 
                        importance=TRUE, ntree=2000)
    

    The subset argument to this function only works on a row basis.