Search code examples
rglmcross-validationauc

Calculate AUC curve for responses in form cbind(Count_1, Count_0)


I trained a binomial model using glm(Xtrain, ytrain, formula='cbind(Response, n - Response) ~ features', family='binomial'), where ytrain is a response matrix with columns of counts (yes), counts (no).

The test responses I've held out are in the same form of response matrix. However, the predict() function returns probabilities -- one for each row of the training data. I now want to use the ROCR or AUC package to generate AUC curves, but my prediction and observations are in different formats. Does anyone know how to do this?


OK. Adding an example. Forgive it being meaningless/rank deficient/small, I only want to illustrate my case.

plants <- c('Cactus', 'Tree', 'Cactus', 'Tree', 'Flower', 'Tree', 'Tree')
sun <- c('Full', 'Half', 'Half', 'Full', 'Full', 'Half', 'Full')
water <- c('N', 'Y', 'Y', 'N', 'Y', 'N', 'N')
died <- c(10, 10, 8, 2, 15, 20, 12)
didntdie <- c(2, 10, 8, 20, 10, 10, 10)
df <- data.frame(died, didntdie, plants, sun, water)
dftrain <- head(df, 5)
dftest <- tail(df, 2)
model <- glm("cbind(died, didntdie) ~ plants + sun + water", data=dftrain, family="binomial")

At this point, predict(model, dftest) returns the log-odds (giving a probability of death) for the final two sets of features in my dataframe. Now I wish to compute an AUC curve. My observations are in dftest[c('died','didntdie')]. My predictions are essentially probabilities. AUC, ROCR, etc expect both predictions and observations to be a list of bernoulli responses. I can't find documentation on how to use this response matrix instead. Any help appreciated.


Solution

  • For starters, you could expand the data frame to synthesize binary outcomes with counts that exploit the weight= argument to glm().

    obs <- died + didntdie
    df <- df[rep(1:length(obs), each= 2),] # one row for died and one for didn't
    df$survived <- rep(c(0L,1L), times=length(obs)) # create binary outcome for survival
    df$weight <- c(rbind(died, didntdie)) # assign weights
    df
    
    #     died didntdie plants  sun water survived weight
    # 1     10        2 Cactus Full     N        0     10
    # 1.1   10        2 Cactus Full     N        1      2
    # 2     10       10   Tree Half     Y        0     10
    # 2.1   10       10   Tree Half     Y        1     10
    # 3      8        8 Cactus Half     Y        0      8
    # 3.1    8        8 Cactus Half     Y        1      8
    # 4      2       20   Tree Full     N        0      2
    # 4.1    2       20   Tree Full     N        1     20
    # 5     15       10 Flower Full     Y        0     15
    # 5.1   15       10 Flower Full     Y        1     10
    # 6     20       10   Tree Half     N        0     20
    # 6.1   20       10   Tree Half     N        1     10
    # 7     12       10   Tree Full     N        0     12
    # 7.1   12       10   Tree Full     N        1     10
    
    model <- glm(survived ~ plants + sun + water, data=df, family="binomial", weights = weight)
    

    If you want to do the train/test split you will need to do another expansion, this time duplicating rows on weight. Otherwise your test set is not a random holdout, at least one randomized at the individual plant level, which may invalidate your results (depending on what you are trying to conclude).

    Thus you would do something like

    df <- df[rep(1:nrow(df), times = df$weight),]
    model <- glm(survived ~ plants + sun + water, data=df, family="binomial") 
    # note the model does not change
    
    library(pROC)
    auc(model$fitted.values, df$survived)
    # Area under the curve: 0.5833
    

    Note this is the in-sample AUC. You should use a randomized holdout (or better yet, cross-validation) to estimate out-of-sample AUC. Using the top N rows of the data.frame for the split is not a good idea unless the row order has already been randomized.