Search code examples
rvalidationclassificationrandom-foresttraining-data

Using validation and training dataset in a RandomForest


I'm here to ask a basic question about the use of the RandomForest function in the RandomForest package. I am using the RF algorithm to perform a land cover classification.

I have got some geo-spatial data which I divided into a training dataset (pks_trainingdf) and a validation dataset (pks_validationdf).

Each df contains 34 columns; the first 33 columns are the bands that I want to use for the classification; the last column ("class") contains the classes, which are supposed to be the output of the RF classification.

My question is: which dataset is the argument of x and which of xtest? Is the following line of code correct?

modelRF_5 <- randomForest(x=pks_validationdf[, c(1:33)],
                       y=pks_validationdf$class, xtest=pks_trainingdf[, c(1:33)],
                       ytest=pks_trainingdf$class, importance=TRUE)

Solution

  • x is for the training subset while xtest is for test or validation subset. In your case it looks you inverse. it is only not important if both have same size (that is not the case usually). It is important you randomized the data set before splitting into training and validation subsets. If not you should change it. In addition, it is safer to split in three subset rather than 2. One for training, other for validating the model and the last for reporting the error.