I'm here to ask a basic question about the use of the RandomForest
function in the RandomForest package
.
I am using the RF algorithm
to perform a land cover classification.
I have got some geo-spatial
data which I divided into a training dataset (pks_trainingdf) and a validation dataset (pks_validationdf).
Each df
contains 34 columns; the first 33 columns are the bands that I want to use for the classification; the last column ("class") contains the classes, which are supposed to be the output of the RF
classification.
My question is: which dataset is the argument of x
and which of xtest
?
Is the following line of code correct?
modelRF_5 <- randomForest(x=pks_validationdf[, c(1:33)],
y=pks_validationdf$class, xtest=pks_trainingdf[, c(1:33)],
ytest=pks_trainingdf$class, importance=TRUE)
x is for the training subset while xtest is for test or validation subset. In your case it looks you inverse. it is only not important if both have same size (that is not the case usually). It is important you randomized the data set before splitting into training and validation subsets. If not you should change it. In addition, it is safer to split in three subset rather than 2. One for training, other for validating the model and the last for reporting the error.