I am building a tuned random forest model for multiclass classification. I'm getting the following results Training accuracy(AUC) :0.9921996 Testing accuracy(AUC) :0.992237664 I saw a question related to this on this website and the common answer seems to be that the dataset must be small and your model got lucky But in my case I have about 300k training data points and 100k testing data points Also my classes are well balanced
> summary(train$Bucket)
0 1 TO 30 121 TO 150 151 TO 180 181 TO 365 31 TO 60 366 TO 540 541 TO 730 61 TO 90
166034 32922 4168 4070 15268 23092 8794 6927 22559
730 + 91 TO 120
20311 11222
> summary(test$Bucket)
0 1 TO 30 121 TO 150 151 TO 180 181 TO 365 31 TO 60 366 TO 540 541 TO 730 61 TO 90
55344 10974 1389 1356 5090 7698 2932 2309 7520
730 + 91 TO 120
6770 3741
Is it possible for a model to fit this well on a large testing data? Please answer if I can do something to cross verify that my model is indeed fitting really well.
My complete code
split = sample.split(Book2$Bucket,SplitRatio =0.75)
train = subset(Book2,split==T)
test = subset(Book2,split==F)
traintask <- makeClassifTask(data = train,target = "Bucket")
rf <- makeLearner("classif.randomForest")
params <- makeParamSet(makeIntegerParam("mtry",lower = 2,upper = 10),makeIntegerParam("nodesize",lower = 10,upper = 50))
#set validation strategy
rdesc <- makeResampleDesc("CV",iters=5L)
#set optimization technique
ctrl <- makeTuneControlRandom(maxit = 5L)
#start tuning
tune <- tuneParams(learner = rf ,task = traintask ,resampling = rdesc ,measures = list(acc) ,par.set = params ,control = ctrl ,show.info = T)
rf.tree <- setHyperPars(rf, par.vals = tune$x)
tune$y
r<- train(rf.tree, traintask)
getLearnerModel(r)
testtask <- makeClassifTask(data = test,target = "Bucket")
rfpred <- predict(r, testtask)
performance(rfpred, measures = list(mmce, acc))
The difference is of order 1e-4, nothing is wrong, it is a regular, statistical error (variance of the result). Nothing to worry about. This literally means that a difference is about 0.0001 * 100,000 = 10 samples ... 10 samples out of 100k.