Search code examples
rrocprecision-recall

data values in ROC curve using PRROC package


I am trying to plot a ROC curve of an identifier used to determine positive incidences against background dataset. The identifier is a list of probability scores with some overlap between the two groups.

FG          BG
0.02        0.10
0.03        0.25 
0.02        0.12
0.04        0.16
0.05        0.45
0.12        0.31
0.13        0.20

(where FG = Positive and BG = Negative.)

I am plotting a ROC curve using PRROC in R to assess how well the identifier classifies the data into the correct group. Although there is a clear distinction between the classifier values produced between the positive and negative datasets, but my current ROC plot in R shows a low AUC value. My probability scores for the positive data are lower than the background so if I switch the classification around and have the background as the foreground points, I get a high scoring AUC curve and I am not 100% clear why this is the case, which plot is the best to use or whether there was an additional step I have missed before analysing my data.

roc <- roc.curve(scores.class0 = FG, scores.class1 = BG, curve = T)

ROC curve

Area under curve:
0.07143

roc2 <- roc.curve(scores.class0 = BG, scores.class1 = FG, curve = T)

ROC curve

Area under curve:
0.92857

Solution

  • As you have indeed noticed, most ROC analysis tools assume that the scores in your positive class are higher than those of the negative class. More formally, an instance is classified as "positive" if X > T, where T is the decision threshold, and negative otherwise.

    There is no fundamental reason for it to be so. It is perfectly valid to have a decision such as X < T, however most ROC software don't have that option.

    Using your first option resulting in AUC = 0.07143 would imply that your classifier performs worse than random. This is not correct.

    As you noticed, swapping the class labels yields the correct curve value. This is possible because ROC curves are insensitive to class distributions - and the classes can be reverted without a problem. However I wouldn't personally recommend that option. I can see two cases where this can be misleading:

    • to someone else looking at the code, or yourself in a few months; figuring the classes are wrong and "fixing" them
    • or if you want to apply the same code to PR curves, which are sensitive to class distributions and where you cannot swap the classes.

    An alternative and preferable approach would be to invert your scores for this analysis, so that the positive class effectively has higher scores:

    roc <- roc.curve(scores.class0 = -FG, scores.class1 = -BG, curve = T)