Search code examples
apache-sparkapache-spark-sqlapache-spark-mllibapache-spark-ml

Scala Spark : Calculate grouped-by AUC


I'm trying to computing the AUC (area under ROC) grouped by a key field using the scala API, similarly to the following question : PySpark: Calculate grouped-by AUC.

Unfortunately, I can't use sklearn. How can I proceed ?


Solution

  • We will be using the same method used in sklearn/mllib which is the Trapezoidal rule. It's a technique used for approximating the definite integral.

    It's quite straight-forward, you can find the same code in the source code.

    def trapezoid(points: Seq[(Double, Double)]): Double = {
        require(points.length == 2)
        val x = points.head
        val y = points.last
        (y._1 - x._1) * (y._2 + x._2) / 2.0
    }
    
    def areaUnderCurve(curve: Iterable[(Double, Double)]): Double = {
        curve.toIterator.sliding(2).withPartial(false).aggregate(0.0)(
          seqop = (auc: Double, points: Seq[(Double, Double)]) => auc + trapezoid(points),
          combop = _ + _
        )
    }
    
    val seq = Seq((0.0, 0.0), (1.0, 1.0), (2.0, 3.0), (3.0, 0.0))
    areaUnderCurve(seq)
    // res77: Double = 4.0 
    

    The result is 4.0 as expected.

    Now let's apply that on a dataset. Data is already grouped by a key here :

    val data = Seq(("id1", Array((0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.0))), ("id2", Array((0.5, 1.0), (0.6, 0.0), (0.7, 1.0), (0.8, 0.3)))).toDF("key","values")
    
    case class Record(key : String, values : Seq[(Double,Double)])
    
    data.as[Record].map(r => (r.key, r.values, areaUnderCurve(r.values))).show
    // +---+--------------------+-------------------+
    // | _1|                  _2|                 _3|
    // +---+--------------------+-------------------+
    // |id1|[[0.5, 1.0], [0.6...|0.15000000000000002|
    // |id2|[[0.5, 1.0], [0.6...|0.16500000000000004|
    // +---+--------------------+-------------------+
    

    I hope this helps.