I apologize in advance that this is such a simple question, but I've been having a very hard time figuring it out with google and stack exchange searches.
I have a dataset which I'd like to run a random forest on. Some of the variables are factors with more than 32 levels, so I've converted them to dummy variables in order to run a random forest. The problem is that this has left me with 1000+ variables, not all of which I want to use in my random forest, though most of which I would like to use.
My random forest code would look like this, except with waaaay too many dummy variables for me to reasonably list by hand.
fit <- randomForest(result ~ dummy_1 + dummy_2 + dummy_3..., data=df, importance=TRUE, ntree=2000)
Essentially my question is if there is a way to specify large ranges of columns in a random forest without listing them by name. I have tried running model.matrix within the random forest command, and trying to specify a range of columns using df[1:34,] etc, but neither of these methods have worked.
Thank you in advance!
e: I suppose just dropping the columns and making a new dataframe could work, but is there a good alternative?
You can exclude variables by changing what's delivered to the function in the data
argument.
exclude_cols <- c('dummy_48','dummy_50','other_var_to_be_dropped')
fit <- randomForest(result ~ .,
data=df[ !names(df) %in% exclude_cols ] ,
importance=TRUE, ntree=2000)
The subset
argument to this function only works on a row basis.