I'm running few binary classifications with pyspark and I'm using BinaryClassificationEvaluator
to evaluate the predictions made on the test set. Why on earth I have different results if I use the sklearn roc_auc_score? For example:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.classification import RandomForestClassifier
trainDF, testDF = df.randomSplit([.8, .2], seed=42)
evaluator = BinaryClassificationEvaluator(
labelCol="label", rawPredictionCol="prediction", metricName="areaUnderROC")
rf = RandomForestClassifier(labelCol="label", featuresCol="features")
rfModel = rf.fit(trainDF)
prediction = rfModel.transform(testDF)
# now in the DataFrame prediction I have those columns: 'label','prediction','probability'
# +-----+----------+---------------------+
# |label|prediction|probability |
# +-----+----------+---------------------+
# |0 |0.0 |[1.0,0.0] |
# |0 |0.0 |[0.9765625,0.0234375]|
# |0 |0.0 |[0.9765625,0.0234375]|
# +-----+----------+---------------------+
areaUnderROC = evaluator.evaluate(prediction) #IT RETURNS 0.954459
#NOW I USE PANDAS
RF_pred = prediction.select('label', 'prediction', 'probability').toPandas()
probRF=[]
for i in range(prediction.count()):
probRF.append(RF_pred['probability'][i][1]) #it takes only the probability for the label 1
auc = roc_auc_score(RF_pred['label'], probRF) #IT RETURNS 0.9962
How it is possible?
I should have used the probability column or leave it with the default value I guess!! In this way I have the same value of roc_auc_score. Shouldn't I have an error or something with the wrong column?
evaluator = BinaryClassificationEvaluator(
labelCol="label",
rawPredictionCol="probability",
metricName="areaUnderROC")
I hope this will help because there is not much about BinaryClassificationEvaluator
even in the official documentation, they only use MulticlassClassificationEvaluator
in their examples