I am trying to plot a ROC curve with standard deviation using the the ROCR package.
I am using the quality.csv file for a reproducible example to be found here -- https://courses.edx.org/courses/course-v1:MITx+15.071x_3+1T2016/courseware/5893e4c5afb74898b8e7d9773e918208/030bf0a7275744f4a3f6f74b95169c04/
My code is the following:
data <- fread("quality.csv")
glimpse(data)
set.seed(88)
split <- sample.split(data$PoorCare, SplitRatio = 0.75)
data_train <- data[split, ]
data_test <- data[!split, ]
#--------------------------------------------------------------------------
# FITTING A MODEL
#--------------------------------------------------------------------------
model <- glm(PoorCare ~ OfficeVisits + Narcotics , data_train, family = "binomial")
#--------------------------------------------------------------------------
# MAKE PREDICTIONS ON THE TEST DATASET
#--------------------------------------------------------------------------
predict_Test <- predict(model, type = "response", newdata = data_test)
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
# THE ROCR PACKAGE
#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
###########################################################################
# CREATE A PERFORMANCE OBJECT
###########################################################################
prediction_obj <- prediction(predict_Test, data_test$PoorCare)
#==========================================================================
# CALCULATE AUC
#==========================================================================
auc = as.numeric(performance(prediction_obj , "auc")@y.values)
# 0.7994792
#==========================================================================
# PLOT ROC CURVE WITH ERROR ESTIMATES
#==========================================================================
plot(perf, colorize=T, avg='threshold', spread.estimate='stddev', spread.scale = 2)
What I get is a ROC curve but without the standard errors:
Could you indicate what is wrong with my code and how to correct it?
Your advice will be appreciated.
The standard deviations and the CIs of the ROC curve can be plotted if a number of repeated (cross-validation or bootstrap) predictions has been performed.
Consider for example 100 repeated splits of data in training and testing sets with glm
estimation and prediction:
library(dplyr)
library(data.table)
library(caTools)
library(ROCR)
data <- fread("quality.csv")
glimpse(data)
set.seed(1)
reps <- 100
predTests <- vector(mode="list", reps)
Labels <- vector(mode="list", reps)
for (k in 1:reps) {
splitk <- sample.split(data$PoorCare, SplitRatio = 0.75)
data_traink <- data[splitk, ]
data_testk <- data[!splitk, ]
model <- glm(PoorCare ~ OfficeVisits + Narcotics ,
data_traink, family = "binomial")
predTests[[k]] <- predict(model, type = "response", newdata = data_testk)
Labels[[k]] <- data_testk$PoorCare
}
Now calculate prediction
and performance
objects using the predTests
and Labels
lists:
predObjs <- prediction(predTests, Labels)
Perfs <- performance(predObjs , "tpr", "fpr")
and plot the set of ROC curves with mean values and confidence intervals:
plot(Perfs, col="grey82", lty=3)
plot(Perfs, lwd=3, avg="threshold", spread.estimate="stddev", add=TRUE, colorize=TRUE)