Search code examples
machine-learningscikit-learnrandom-forestauc

Testing accuracy more than training accuracy


I am building a tuned random forest model for multiclass classification. I'm getting the following results Training accuracy(AUC) :0.9921996 Testing accuracy(AUC) :0.992237664 I saw a question related to this on this website and the common answer seems to be that the dataset must be small and your model got lucky But in my case I have about 300k training data points and 100k testing data points Also my classes are well balanced

> summary(train$Bucket)
         0   1 TO  30 121 TO 150 151 TO 180 181 TO 365   31 TO 60 366 TO 540 541 TO 730   61 TO 90 
    166034      32922       4168       4070      15268      23092       8794       6927      22559 
     730 +  91 TO 120 
     20311      11222 
> summary(test$Bucket)
         0   1 TO  30 121 TO 150 151 TO 180 181 TO 365   31 TO 60 366 TO 540 541 TO 730   61 TO 90 
     55344      10974       1389       1356       5090       7698       2932       2309       7520 
     730 +  91 TO 120 
      6770       3741 

Is it possible for a model to fit this well on a large testing data? Please answer if I can do something to cross verify that my model is indeed fitting really well.

My complete code

split = sample.split(Book2$Bucket,SplitRatio =0.75)
train = subset(Book2,split==T)
test = subset(Book2,split==F)
traintask <- makeClassifTask(data = train,target = "Bucket")
rf <- makeLearner("classif.randomForest")

params <- makeParamSet(makeIntegerParam("mtry",lower = 2,upper = 10),makeIntegerParam("nodesize",lower = 10,upper = 50)) 

#set validation strategy 
rdesc <- makeResampleDesc("CV",iters=5L) 

#set optimization technique 
ctrl <- makeTuneControlRandom(maxit = 5L)

#start tuning 

tune <- tuneParams(learner = rf ,task = traintask ,resampling = rdesc ,measures = list(acc) ,par.set = params ,control = ctrl ,show.info = T) 

rf.tree <- setHyperPars(rf, par.vals = tune$x)
tune$y

r<- train(rf.tree, traintask)
getLearnerModel(r)

testtask <- makeClassifTask(data = test,target = "Bucket")

rfpred <- predict(r, testtask)
performance(rfpred, measures = list(mmce, acc))

Solution

  • The difference is of order 1e-4, nothing is wrong, it is a regular, statistical error (variance of the result). Nothing to worry about. This literally means that a difference is about 0.0001 * 100,000 = 10 samples ... 10 samples out of 100k.