I'm relatively new in this field and a bit confused right now... I'll explain: I've some elements in my data, each with a value between 0 and 1 and an associated label (1, 0). I need to test some thresholds, for example with a threshold = 0.4, all the values > 0.4 will be predicted as true (1) and all the values < 0.4 will be predicted as false (0). I think I don't need a machine learning classifiers because, based on the threshold that I choose, I already know which label assign to each element.
This is what I've done until now:
prediction = []
for row in range(dfAggr.shape[0]):
if dfAggr['value'].values[row] >= threshold:
prediction.append(1)
else
prediction.append(0)
label = dfAggr['truth'].values.astype(int)
#ROC CURVE
fpr, tpr, thresholds = roc_curve(label, prediction)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=1, label='ROC (area = %0.2f)' % (roc_auc))
plt.plot([0, 1], [0, 1], '--', color=(0.6, 0.6, 0.6), label='Luck')
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.grid()
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.savefig("rocauc.pdf", format="pdf")
plt.show()
I think this plot is quite wrong, since I want a ROC curve build by testing each possible threshold between 0 and 1 to get the best possible value of cutoff.
Is it conceptually wrong what I've done?
I assume you are using from sklearn.metrics import roc_curve
. The roc_curve
function will go through all the thresholds for you, there is no need to pre-select one yourself.
You should do something like this:
predictions = dfAggr['value'].values
label = dfAggr['truth'].values.astype(int)
fpr, tpr, thresholds = roc_curve(label, predictions)
[...]