Search code examples
rggplot2tidyverseroc

ROC curve in ggplot calculation [r]


I am trying to create a ROC curve in ggplot

I wrote function myself, however when I compare my results to results from roc_curve function from community (that I believe more) I get different results.

I would like to ask where is mistake in the function below?

library(ggplot2)
library(dplyr)
library(yardstick)
n <- 300 # sample size
data <- 
data.frame(
  real = sample(c(0,1), replace=TRUE, size=n), 
  pred = sample(runif(n), replace=TRUE, size=n)
)


simple_roc <- function(labels, scores){
  labels <- labels[order(scores, decreasing=TRUE)]
  data.frame(TPR=cumsum(labels)/sum(labels), FPR=cumsum(!labels)/sum(!labels), labels)
}



simple_roc(data$real, data$pred) %>% 
  ggplot(aes(TPR, FPR)) + 
  geom_line()


yardstick::roc_curve(data, factor(real), pred) %>% 
  ggplot(aes(1 - specificity, sensitivity)) + 
  geom_line()



Solution

  • First you need to anchor your ROC curve in the points (0, 0) and (1, 1).

    simple_roc <- function(labels, scores){
      labels <- labels[order(scores, decreasing=TRUE)]
      data.frame(
                 TPR = c(0, cumsum(labels)/sum(labels), 1),
                 FPR = c(0, cumsum(!labels)/sum(!labels), 1)
      )
    }
    

    Then the order in which your data is presented matters in ggplot2. Reversing the line direction should get you a bit closer:

    yardstick::roc_curve(data, factor(real), pred) %>% 
      ggplot(aes(rev(1 - specificity), rev(sensitivity))) + 
      geom_line()
    

    I would recommend against using your own function for any serious work. There are many other things that can go wrong and that well-maintained packages will handle properly such as missing values, infinite values, absence of some labels, and others that I can't even think about right now.