Search code examples
scalaapache-spark-sqlgraphframes

How to find sum/avg of sparkVector element of a DataFrame in Spark/Scala?


I have pageranks result from ParallelPersonalizedPageRank in Graphframes, which is a DataFrame with each element as sparseVector as following:

+---------------------------------------+
|           pageranks                   |
+---------------------------------------+
|(1887,[0, 1, 2,...][0.1, 0.2, 0.3, ...]|
|(1887,[0, 1, 2,...][0.2, 0.3, 0.4, ...]|
|(1887,[0, 1, 2,...][0.3, 0.4, 0.5, ...]|
|(1887,[0, 1, 2,...][0.4, 0.5, 0.6, ...]|
|(1887,[0, 1, 2,...][0.5, 0.6, 0.7, ...]|

What is the best way to add all the element of the sparseVector and generatre a sum or average? I suppose we can converter each sparseVector to denseVector with toArray and traverse each array to get the result with two nested loop, and get some thing like this:

+-----------+
|pageranks  |
+-----------+
|avg1|
|avg2|
|avg3|
|avg4|
|avg5|
|... |

I am sure there should be better way, but I could not find much on the API docs about sparseVector operation. Thanks!


Solution

  • I think I found a solution without collect (materialize) the results and do nested loop in Scala. Just post here in case it is helpful for others.

    // convert Dataset element from SparseVector to Array
    val ranksNursingArray = ranksNursing.vertices
      .orderBy("id")
      .select("pageranks")
      .map(r => 
      r(0).asInstanceOf[org.apache.spark.ml.linalg.SparseVector].toArray)
    // Find average value of pageranks and add a column to DataFrame
    val ranksNursingAvg = ranksNursingArray
      .map{case (value) => (value, value.sum/value.length)}
      .toDF("pageranks", "pr-avg")
    

    The end results look like this:

    +--------------------+--------------------+                                     
    |           pageranks|              pr-avg|
    +--------------------+--------------------+
    |[1.52034575371428...|2.970332668789975E-5|
    |[0.0, 0.0, 0.0, 0...|5.160299770346173E-6|
    |[0.0, 0.0, 0.0, 0...|4.400537827779479E-6|
    |[0.0, 0.0, 0.0, 0...|3.010621958524792...|
    |[0.0, 0.0, 4.8987...|2.342424435412115E-5|
    |[0.0, 0.0, 1.6895...|6.955151139681538E-6|
    |[0.0, 0.0, 1.5669...| 5.47016001804886E-6|
    |[0.0, 0.0, 0.0, 2...|2.303811469709906E-5|
    |[0.0, 0.0, 0.0, 3...|1.985155979369427E-5|
    |[0.0, 0.0, 0.0, 0...|1.411993797780601...|
    +--------------------+--------------------+