I trained a binomial model using glm(Xtrain, ytrain, formula='cbind(Response, n - Response) ~ features', family='binomial')
, where ytrain is a response matrix with columns of counts (yes), counts (no).
The test responses I've held out are in the same form of response matrix. However, the predict() function returns probabilities -- one for each row of the training data. I now want to use the ROCR or AUC package to generate AUC curves, but my prediction and observations are in different formats. Does anyone know how to do this?
OK. Adding an example. Forgive it being meaningless/rank deficient/small, I only want to illustrate my case.
plants <- c('Cactus', 'Tree', 'Cactus', 'Tree', 'Flower', 'Tree', 'Tree')
sun <- c('Full', 'Half', 'Half', 'Full', 'Full', 'Half', 'Full')
water <- c('N', 'Y', 'Y', 'N', 'Y', 'N', 'N')
died <- c(10, 10, 8, 2, 15, 20, 12)
didntdie <- c(2, 10, 8, 20, 10, 10, 10)
df <- data.frame(died, didntdie, plants, sun, water)
dftrain <- head(df, 5)
dftest <- tail(df, 2)
model <- glm("cbind(died, didntdie) ~ plants + sun + water", data=dftrain, family="binomial")
At this point, predict(model, dftest)
returns the log-odds (giving a probability of death) for the final two sets of features in my dataframe. Now I wish to compute an AUC curve. My observations are in dftest[c('died','didntdie')]
. My predictions are essentially probabilities. AUC, ROCR, etc expect both predictions and observations to be a list of bernoulli responses. I can't find documentation on how to use this response matrix instead. Any help appreciated.
For starters, you could expand the data frame to synthesize binary outcomes with counts that exploit the weight=
argument to glm()
.
obs <- died + didntdie
df <- df[rep(1:length(obs), each= 2),] # one row for died and one for didn't
df$survived <- rep(c(0L,1L), times=length(obs)) # create binary outcome for survival
df$weight <- c(rbind(died, didntdie)) # assign weights
df
# died didntdie plants sun water survived weight
# 1 10 2 Cactus Full N 0 10
# 1.1 10 2 Cactus Full N 1 2
# 2 10 10 Tree Half Y 0 10
# 2.1 10 10 Tree Half Y 1 10
# 3 8 8 Cactus Half Y 0 8
# 3.1 8 8 Cactus Half Y 1 8
# 4 2 20 Tree Full N 0 2
# 4.1 2 20 Tree Full N 1 20
# 5 15 10 Flower Full Y 0 15
# 5.1 15 10 Flower Full Y 1 10
# 6 20 10 Tree Half N 0 20
# 6.1 20 10 Tree Half N 1 10
# 7 12 10 Tree Full N 0 12
# 7.1 12 10 Tree Full N 1 10
model <- glm(survived ~ plants + sun + water, data=df, family="binomial", weights = weight)
If you want to do the train/test split you will need to do another expansion, this time duplicating rows on weight
. Otherwise your test set is not a random holdout, at least one randomized at the individual plant level, which may invalidate your results (depending on what you are trying to conclude).
Thus you would do something like
df <- df[rep(1:nrow(df), times = df$weight),]
model <- glm(survived ~ plants + sun + water, data=df, family="binomial")
# note the model does not change
library(pROC)
auc(model$fitted.values, df$survived)
# Area under the curve: 0.5833
Note this is the in-sample AUC. You should use a randomized holdout (or better yet, cross-validation) to estimate out-of-sample AUC. Using the top N rows of the data.frame for the split is not a good idea unless the row order has already been randomized.