java apache-spark cassandra datastax datastax-java-driver

How to do aggragation like avg, max sum in cassandra table using spark java api

I am having huge data in cassandra db, i want to do aggregation like avg, max and sum for some column name name using spark java api

I tried like below

cassandraRowsRDD
  .select("name", "age", "ann_salaray", "dept","bucket", "resourceid", "salaray")
  .where("timestamp = ?", "2018-01-09 00:00:00")
  .withAscOrder()

I saw this method - .aggregate(zeroValue, seqOp, combOp), but don't know how to use it

Expected :

max(salary column name)
avg(salary column name)

I have tried with CQL, getting failed because of huge data

Can any one give me an example for aggregation in cassandra tables using spark java api

Solution

the first parameter provides so-called "zero value" that is used to initialize "accumulator", 2nd parameter - function that takes accumulator & single value from your RDD, and 3rd parameter - function that takes 2 accumulators and combine them.

For your task you may use something like this (pseudo-code)

res = rdd.aggregate((0,0,0),
   (acc, value) => (acc._1 + 1,
                    acc._2 + value.salary,
                    if (acc._3 > value.salary) then acc._3 else value.salary),
   (acc1, acc2) => (acc1._1 + acc2._1,
                    acc1._2 + acc2._2,
                    if (acc1._3 > acc2._3) then acc1._3 else acc2._3))
 val avg = res._2/res._1
 val max = res._3

In this case we have:

(0,0,0) - tuple of 3 elements representing, correspondingly: number of elements in RDD, sum of all salaries, and max salary
function that generate a new tuple from accumulator & value
function that combines 2 tuples

and then having number of entries, full sum of salaries, and max, we can find all necessary data.