Search code examples
pythonfor-loopauc

Calculate AUC in Python by hand


Using R, I am able to manually calculate [and plot] the AUC using the following code and for loop:

test = data.frame(cbind(dt$DV, predicted_prob))
colnames(test)[1] = 'DV' 
colnames(test)[2] = 'DV_pred_prob' 

TP = rep(NA,101)
FN = rep(NA,101)
FP = rep(NA,101)
TN = rep(NA,101)
Sensitivity = rep(NA,101)
Specificity = rep(NA,101)
AUROC = 0

for(i in 0:100){
  test$temp = 0
  test[test$DV_pred_prob > (i/100),"temp"] = 1
  TP[i+1] = nrow(test[test$DV==1 & test$temp==1,])
  FN[i+1] = nrow(test[test$DV==1 & test$temp==0,])
  FP[i+1] = nrow(test[test$DV==0 & test$temp==1,])
  TN[i+1] = nrow(test[test$DV==0 & test$temp==0,])
  Sensitivity[i+1] = TP[i+1] / (TP[i+1] + FN[i+1] )
  Specificity[i+1] = TN[i+1] / (TN[i+1] + FP[i+1] )
  if(i>0){
    AUROC = AUROC+0.5*(Specificity[i+1] - Specificity[i])*(Sensitivity[i+1] + 
Sensitivity[i])
  }
}

data = data.frame(cbind(Sensitivity,Specificity,id=(0:100)/100))

I am attempting to write the same code in Python, but am running into the error "TypeError: 'Series' objects are mutable, thus they cannot be hashed"

I am very new to Python and am trying to become bilingual with R and Python. Can someone point me in the right direction in terms of solving this?

predictions = pd.DataFrame(predictions[1])
actual = pd.DataFrame(y_test)
test = pd.concat([actual.reset_index(drop=True), predictions], axis=1)
# Rename column Renew to 'actual' and '1' to 'predictions'
test.rename(columns={"Renew": "actual", 1: "predictions"}, inplace=True)

TP = np.repeat('NA', 101)
FN = np.repeat('NA', 101)
FP = np.repeat('NA', 101)
TN = np.repeat('NA', 101)
Sensitivity = np.repeat('NA', 101)
Specificity = np.repeat('NA', 101)
AUROC = 0

for i in range(100):
    test['temp'] = 0
    test[test['predictions'] > (i/100), "temp"] = 1
    TP[i+1] = [test[test["actual"]==1 and test["temp"]==1,]].shape[0]
    FN[i+1] = [test[test["actual"]==1 and test["temp"]==0,]].shape[0]
    FP[i+1] = [test[test["actual"]==0 and test["temp"]==1,]].shape[0]
    TN[i+1] = [test[test["actual"]==0 and test["temp"]==0,]].shape[0]
    Sensitivity[i+1] = TP[i+1] / (TP[i+1] + FN[i+1])
    Specificity[i+1] = TN[i+1] / (TN[i+1] + FP[i+1])
    if(i > 0):
            AUROC = AUROC+0.5*(Specificity[i+1] - Specificity[i])* 
(Sensitivity[i+1] + Sensitivity[i])

The error seems to be occurring around the portion of code containing (i/100).


Solution

  • Pandas indexing doesn't work the way you are expecting. You can't use df[rows, cols] instead you use .loc (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html)

    So yes - you're right in that the error is caused by your line :

    test[test['predictions'] > (i/100), "temp"] = 1.

    To fix it you would use:

    test.loc[test['predictions'] > (i/100), "temp"] = 1.

    ... then you'll run into issues on the following 4 lines that follow the format:

    TP[i+1] = test[test["actual"]==1 and test["temp"]==1,].shape[0]

    You need wrap each evaluation statement in parenthese and change your and to &. There's a good discussion of why this is here: Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). So your code should look like:

    TP[i+1] = len(test[(test["actual"]==1) & (test["temp"]==1)])

    Note; we can use the len function rather than the first element of the dataframes shape attribute to count the number of rows. That's just my preference though.

    Finally; you can't set 'NA' values that way in python; you would use np.NAN. The final if statement will fail because you have made arrays of strings as placeholders. I think np.zeros(101) would work for you.

    Your full code with my edits:

    predictions = pd.DataFrame(predictions[1])
    actual = pd.DataFrame(y_test)
    test = pd.concat([actual.reset_index(drop=True), predictions], axis=1)
    
    # Rename column Renew to 'actual' and '1' to 'predictions'
    
    test.columns = ['actual', 'predictions'] #<- You can assign column names using a list
    
    TP = np.zeros(101)
    FN = np.zeros(101)
    FP = np.zeros(101)
    TN = np.zeros(101)
    Sensitivity = np.zeros(101)
    Specificity = np.zeros(101)
    AUROC = 0
    
    for i in range(10):
        test['temp'] = 0
        test.loc[test['predictions'] > (i / 100), 'temp'] = 1
        TP[i+1] = len(test[(test["actual"]==1) & (test["temp"]==1)])
        FN[i+1] = len(test[(test["actual"]==1) & (test["temp"]==0)])
        FP[i+1] = len(test[(test["actual"]==0) & (test["temp"]==1)])
        TN[i+1] = len(test[(test["actual"]==0) & (test["temp"]==0)])
        Sensitivity[i+1] = TP[i+1] / (TP[i+1] + FN[i+1])
        Specificity[i+1] = TN[i+1] / (TN[i+1] + FP[i+1])
        if i > 0:
                AUROC += 0.5 * (Specificity[i+1] - Specificity[i]) *  (Sensitivity[i+1] + Sensitivity[i])