Search code examples
pysparkapache-spark-ml

pyspark extract ROC curve?


Is there a way to get the points on an ROC curve from Spark ML in pyspark? In the documentation I see an example for Scala but not python: https://spark.apache.org/docs/2.1.0/mllib-evaluation-metrics.html

Is that right? I can certainly think of ways to implement it but I have to imagine it’s faster if there’s a pre-built function. I’m working with 3 million scores and a few dozen models so speed matters.


Solution

  • As long as the ROC curve is a plot of FPR against TPR, you can extract the needed values as following:

    your_model.summary.roc.select('FPR').collect()
    your_model.summary.roc.select('TPR').collect())
    

    Where your_model could be for example a model you got from something like this:

    from pyspark.ml.classification import LogisticRegression
    log_reg = LogisticRegression()
    your_model = log_reg.fit(df)
    

    Now you should just plot FPR against TPR, using for example matplotlib.

    P.S.

    Here is a complete example for plotting ROC curve using a model named your_model (and anything else!). I've also plot a reference "random guess" line inside the ROC plot.

    import matplotlib.pyplot as plt
    plt.figure(figsize=(5,5))
    plt.plot([0, 1], [0, 1], 'r--')
    plt.plot(your_model.summary.roc.select('FPR').collect(),
             your_model.summary.roc.select('TPR').collect())
    plt.xlabel('FPR')
    plt.ylabel('TPR')
    plt.show()