Search code examples
rdplyrrocaucproc-r-package

Issue computing AUC with pROC package


I'm trying to use a function that calls on the pROC package in R to calculate the area under the curve for a number of different outcomes.

# Function used to compute area under the curve
proc_auc <- function(outcome_var, predictor_var) {
            pROC::auc(outcome_var, predictor_var)}

To do this, I am intending to refer to outcome names in a vector (much like below).

# Create a vector of outcome names 
outcome <- c('outcome_1', 'outcome_2')

However, I am having problems defining variables to input into this function. When I do this, I generate the error: "Error in roc.default(response, predictor, auc = TRUE, ...): 'response' must have two levels". However, I can't work out why, as I reckon I only have two levels...

I would be so happy if anyone could help me!

Here is a reproducible code from the iris dataset in R.

library(pROC)
library(datasets)
library(dplyr)

# Use iris dataset to generate binary variables needed for function
df <- iris %>% dplyr::mutate(outcome_1 = as.numeric(ntile(Sepal.Length, 4)==4), 
                 outcome_2 = as.numeric(ntile(Petal.Length, 4)==4))%>%
                 dplyr::rename(predictor_1 = Petal.Width)

# Inspect binary outcome variables 
df %>% group_by(outcome_1) %>% summarise(n = n()) %>% mutate(Freq = n/sum(n))
df %>% group_by(outcome_2) %>% summarise(n = n()) %>% mutate(Freq = n/sum(n))

# Function used to compute area under the curve
proc_auc <- function(outcome_var, predictor_var) {
            pROC::auc(outcome_var, predictor_var)}

# Create a vector of outcome names 
outcome <- c('outcome_1', 'outcome_2')

# Define variables to go into function
outcome_var <- df %>% dplyr::select(outcome[[1]])
predictor_var <- df %>% dplyr::select(predictor_1)


# Use function - first line works but not last line! 
proc_auc(df$outcome_1, df$predictor_1)
proc_auc(outcome_var, predictor_var)


Solution

  • You'll have to familiarize yourself with dplyr's non-standard evaluation, which makes it pretty hard to program with. In particular, you need to realize that passing a variable name is an indirection, and that there is a special syntax for it.

    If you want to stay with the pipes / non-standard evaluation, you can use the roc_ function which follows a previous naming convention for functions taking variable names as input instead of the actual column names.

    proc_auc2 <- function(data, outcome_var, predictor_var) {
        pROC::auc(pROC::roc_(data, outcome_var, predictor_var))
    }
    

    At this point you can pass the actual column names to this new function:

    proc_auc2(df, outcome[[1]], "predictor_1")
    # or equivalently:
    df %>% proc_auc2(outcome[[1]], "predictor_1")
    

    That being said, for most use cases you probably want to follow @druskacik's answer and use standard R evaluation.