I used two ways to calculate the AUC of training set on the randomForest but I get very different results. The two ways are as follows:
rfmodel <- randomForest(y~., data=train, importance=TRUE, ntree=1000)
Way 1 of calculating AUC of train set:
`rf_p_train <- predict(rfmodel, type="prob",newdata = train)[,'yes']
rf_pr_train <- prediction(rf_p_train, train$y)
r_auc_train[i] <- performance(rf_pr_train, measure = "auc")@y.values[[1]] `
Way 2 of calculating AUC of train set:
rf_p_train <- as.vector(rfmodel$votes[,2])
rf_pr_train <- prediction(rf_p_train, train$y)
r_auc_train[i] <- performance(rf_pr_train, measure = "auc")@y.values[[1]]
Way 1 gives me AUC around 1 but way 2 gives me AUC around 0.65. I am wondering why these two results differ so much. Could anyone help me with this? Really appreciate it. For the data, I am sorry that I am not allowed to share it here. This is the first time for me to ask question here. Please forgive me if there is anything unclear. Thanks a lot!
OK. The second way is correct. Why? Because in the first way, you treat training
data as a new dataset and try to fit it again. In the second way, what you get is actually the so called out of bag
estimate, and that should be the way to calculate AUC.