Search code examples
pythonrrocaucfalse-positive

ROC-AUC FPR FNR in Python and R?


I have a dataframe object in R/Python that looks like:

df columns:
fraud = [1,1,0,0,0,0,0,0,0,1]
score = [0.84, 1, 1.1, 0.4, 0.6, 0.13, 0.32, 1.4, 0.9, 0.45]

When I use roc_curve in Python I get fpr, fnr and thresholds.

I have 2 questions, maybe a bit theoretical but please explain it to me:

  1. Are these thresholds are calculated actually? I have calculated manually fpr and fnr, but are these thresholds = the score above?

  2. How can I generate same fpr, fnr and thresholds in R?


Solution

  • thresholds usually correspond to the value which maximizes tpr + tnr (sensitivity + specificity) this is called the Youden J index (tpr + tnr - 1) but has also several other names.

    take the following example with Sonar dataset:

    library(mlbench)
    library(xgboost)
    library(caret)
    library(pROC)
    data(Sonar)
    

    lets fit a model on part of Sonar data and predict on another part:

    ind <- createDataPartition(Sonar$Class, p = 0.7, list = FALSE)
    train <- Sonar[ind,]
    test <- Sonar[-ind,]
    X = as.matrix(train[, -61])
    dtrain = xgb.DMatrix(data = X, label = as.numeric(train$Class)-1)
    dtest <- xgb.DMatrix(data = as.matrix(test[, -61]))
    

    fit the model on the train data:

    model = xgb.train(data = dtrain, 
                      eval = "auc",
                      verbose = 0,  maximize = TRUE, 
                      params = list(objective = "binary:logistic",
                                    eta = 0.1,
                                    max_depth = 6,
                                    subsample = 0.8,
                                    lambda = 0.1 ), 
                      nrounds = 10)
    
    preds <- predict(model, dtest)
    true <- as.numeric(test$Class)-1
    
    
    plot(roc(response = true,
             predictor =  preds,
             levels=c(0, 1)),
         lwd=1.5, print.thres = T, print.auc = T, print.auc.y = 0.5)
    

    enter image description here

    So if you set the threshold at 0.578 you will maximize the value tpr + tnr and the values in the parenthesis on the plot are tpr and tnr. Verify:

    sensitivity(as.factor(ifelse(preds > 0.578, "1", "0")), as.factor(true))
    #output
    [1] 0.9090909
    specificity(as.factor(ifelse(preds > 0.578, "1", "0")), as.factor(true))\
    #output
    [1] 0.7586207
    

    you could create prediction over many possible thresholds:

    do.call(rbind, lapply((1:1000)/1000, function(x){
      sens <- sensitivity(as.factor(ifelse(preds > x, "1", "0")), as.factor(true))
      spec <- specificity(as.factor(ifelse(preds > x, "1", "0")), as.factor(true))
      data.frame(sens, spec)
    })) -> thresh
    

    and now:

    thresh[which.max(rowSums(thresh)),]
    #output
             sens      spec
    560 0.9090909 0.7586207
    

    You can also check this out:

    thresh[555:600,]
    

    That being said, usually when considering financial data, not just the class is if off interested but also the cost associated with false predictions which is usually not the same for false negatives and false positives. So these models are fit using cost-sensitive classification. More on the mater. On another note, when deciding on the threshold, you should do it either on cross validated data or on a validation set specifically designated for the task. If you use it one the test set that inevitably leads to over-optimistic predictions.