Search code examples
pysparkscikit-learnapache-spark-mllibapache-spark-ml

Why pysark areaUnderROC is different from sklearn roc_auc_score?


I'm running few binary classifications with pyspark and I'm using BinaryClassificationEvaluator to evaluate the predictions made on the test set. Why on earth I have different results if I use the sklearn roc_auc_score? For example:

from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.classification import RandomForestClassifier

trainDF, testDF = df.randomSplit([.8, .2], seed=42)
evaluator = BinaryClassificationEvaluator(
    labelCol="label", rawPredictionCol="prediction", metricName="areaUnderROC")

rf = RandomForestClassifier(labelCol="label", featuresCol="features")
rfModel = rf.fit(trainDF)
prediction = rfModel.transform(testDF)

# now in the DataFrame prediction I have those columns: 'label','prediction','probability'
#  +-----+----------+---------------------+
#  |label|prediction|probability          |
#  +-----+----------+---------------------+
#  |0    |0.0       |[1.0,0.0]            |
#  |0    |0.0       |[0.9765625,0.0234375]|
#  |0    |0.0       |[0.9765625,0.0234375]|
#  +-----+----------+---------------------+

areaUnderROC = evaluator.evaluate(prediction) #IT RETURNS 0.954459

#NOW I USE PANDAS
RF_pred = prediction.select('label', 'prediction', 'probability').toPandas()
probRF=[]
for i in range(prediction.count()):
    probRF.append(RF_pred['probability'][i][1])  #it takes only the probability for the label 1 

auc = roc_auc_score(RF_pred['label'], probRF) #IT RETURNS 0.9962

How it is possible?


Solution

  • I should have used the probability column or leave it with the default value I guess!! In this way I have the same value of roc_auc_score. Shouldn't I have an error or something with the wrong column?

    evaluator = BinaryClassificationEvaluator(
                                             labelCol="label",
                                             rawPredictionCol="probability", 
                                             metricName="areaUnderROC")
    

    I hope this will help because there is not much about BinaryClassificationEvaluator even in the official documentation, they only use MulticlassClassificationEvaluator in their examples