Search code examples
rstatisticsprobabilitynaivebayes

How to use LOOCV to find a subset that classifies better than full set in R


I am working with the wbca data from the faraway package. The prior probability of sampling a malignant tumor is π0 = 1/3 and the prior probability for sampling a benign tumor is π1 = 2/3.

I am trying to use the naive Bayes classifier with multinomials to see if there is a good subset of the 9 features that classifies better than the full set using LOOCV.

I am unsure where to even begin with this, so any Rcode help would be great. Thanks!


Solution

  • You can try something below, the kernel estimate of your predictors might not be the most accurate, but it's something you can start with:

    library(faraway)
    library(naivebayes)
    library(caret)
    
    x = wbca[,!grepl("Class",colnames(wbca))]
    y = factor(wbca$Class)
    
    ctrl <- rfeControl(functions = nbFuncs,
                       method = "LOOCV")
    
    bayesProfile <- rfe(x, y,
                     sizes = subsets,
                     rfeControl = ctrl)
    
    bayesProfile
    
    Recursive feature selection
    
    Outer resampling method: Leave-One-Out Cross-Validation 
    
    Resampling performance over subset size:
    
     Variables Accuracy  Kappa Selected
             2   0.9501 0.8891         
             3   0.9648 0.9225         
             4   0.9648 0.9223         
             5   0.9677 0.9290         
             6   0.9750 0.9454        *
             7   0.9692 0.9322         
             8   0.9750 0.9455         
             9   0.9662 0.9255         
    
    The top 5 variables (out of 6):
       USize, UShap, BNucl, Chrom, Epith
    

    You can get the optimal variables:

    bayesProfile$optVariables
    [1] "USize" "UShap" "BNucl" "Chrom" "Epith" "Thick"