Search code examples
apache-sparkpysparkapache-spark-ml

Spark/Pyspark: SVM - How to get Area-under-curve?


I have been dealing with random forest and naive bayes lately. Now i want to use a Support vector machine.

After fitting the model i wanted to use the output columns "probability" and "label" to compute the AUC value. But now I have seen that there is no column "probability" for SVM?!

Here you can see how I have done so far:

from pyspark.ml.classification import LinearSVC

svm = LinearSVC(maxIter=5, regParam=0.01)
model = svm.fit(train)

scores = model.transform(train)
results = scores.select('probability', 'label')

# Create Score-Label Set for 'BinaryClassificationMetrics'
results_collect = results.collect()
results_list = [(float(i[0][0]), 1.0-float(i[1])) for i in results_collect]
scoreAndLabels = sc.parallelize(results_list)

metrics = BinaryClassificationMetrics(scoreAndLabels)
print("AUC-value: " + str(round(metrics.areaUnderROC,4)))

That was my approach how I have done this in the past for random forest and naive bayes. I thought I could do it with svm too... But that does not work because there is no output column "probability".

Does anyone know why the column "probability" does not exist? And how i can compute the AUC-value now?


Solution

  • Using the most recent spark/pyspark to the time of this answer:

    If you use the pyspark.ml module (unlike mllib), you can work with Dataframe as the interface:

    svm = LinearSVC(maxIter=5, regParam=0.01)
    model = svm.fit(train)
    test_prediction = model.transform(test)
    

    Create the evaluator (see it's source code for settings):

    from pyspark.ml.evaluation import BinaryClassificationEvaluator
    evaluator = BinaryClassificationEvaluator()
    

    Apply evaluator to data (again, source code shows more options):

    evaluation = evaluator.evaluate(test_prediction)
    

    The result of evaluate is, by default, the "Area Under Curve":

    print("evaluation (area under ROC): %f" % evaluation)