Search code examples
rmachine-learninglogistic-regressioncross-validation

Why can't I use cv.glm on the output of bestglm?


I am trying to do best subset selection on the wine dataset, and then I want to get the test error rate using 10 fold CV. The code I used is -

cost1 <- function(good, pi=0) mean(abs(good-pi) > 0.5)
res.best.logistic <-
    bestglm(Xy = winedata,
            family = binomial,          # binomial family for logistic
            IC = "AIC",                 # Information criteria
            method = "exhaustive")
res.best.logistic$BestModels
best.cv.err<- cv.glm(winedata,res.best.logistic$BestModel,cost1, K=10)

However, this gives the error -

Error in UseMethod("family") : no applicable method for 'family' applied to an object of class "NULL"

I thought that $BestModel is the lm-object that represents the best fit, and that's what manual also says. If that's the case, then why cant I find the test error on it using 10 fold CV, with the help of cv.glm?

The dataset used is the white wine dataset from https://archive.ics.uci.edu/ml/datasets/Wine+Quality and the package used is the boot package for cv.glm, and the bestglm package.

The data was processed as -

winedata <- read.delim("winequality-white.csv", sep = ';')
winedata$quality[winedata$quality< 7] <- "0" #recode
winedata$quality[winedata$quality>=7] <- "1" #recode
winedata$quality <- factor(winedata$quality)# Convert the column to a factor
names(winedata)[names(winedata) == "quality"] <- "good"      #rename 'quality' to 'good'

Solution

  • bestglm fit rearranges your data and name your response variable as y, hence if you pass it back into cv.glm, winedata does not have a column y and everything crashes after that

    It's always good to check what is the class:

    class(res.best.logistic$BestModel)
    [1] "glm" "lm" 
    

    But if you look at the call of res.best.logistic$BestModel:

    res.best.logistic$BestModel$call
    
    glm(formula = y ~ ., family = family, data = Xi, weights = weights)
    
    head(res.best.logistic$BestModel$model)
      y fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
    1 0           7.0             0.27        0.36           20.7     0.045
    2 0           6.3             0.30        0.34            1.6     0.049
    3 0           8.1             0.28        0.40            6.9     0.050
    4 0           7.2             0.23        0.32            8.5     0.058
    5 0           7.2             0.23        0.32            8.5     0.058
    6 0           8.1             0.28        0.40            6.9     0.050
      free.sulfur.dioxide density   pH sulphates
    1                  45  1.0010 3.00      0.45
    2                  14  0.9940 3.30      0.49
    3                  30  0.9951 3.26      0.44
    4                  47  0.9956 3.19      0.40
    5                  47  0.9956 3.19      0.40
    6                  30  0.9951 3.26      0.44
    

    You can substitute things in the call etc, but it's too much of a mess. Fitting is not costly, so make a fit on winedata and pass it to cv.glm:

    best_var = apply(res.best.logistic$BestModels[,-ncol(winedata)],1,which)
    # take the variable names for best model
    best_var = names(best_var[[1]])
    new_form = as.formula(paste("good ~", paste(best_var,collapse="+")))
    fit = glm(new_form,winedata,family="binomial")
    
    best.cv.err<- cv.glm(winedata,fit,cost1, K=10)