I am having huge data in cassandra db, i want to do aggregation like avg, max and sum for some column name name using spark java api
I tried like below
cassandraRowsRDD
.select("name", "age", "ann_salaray", "dept","bucket", "resourceid", "salaray")
.where("timestamp = ?", "2018-01-09 00:00:00")
.withAscOrder()
I saw this method - .aggregate(zeroValue, seqOp, combOp), but don't know how to use it
Expected :
max(salary column name)
avg(salary column name)
I have tried with CQL, getting failed because of huge data
Can any one give me an example for aggregation in cassandra tables using spark java api
the first parameter provides so-called "zero value" that is used to initialize "accumulator", 2nd parameter - function that takes accumulator & single value from your RDD, and 3rd parameter - function that takes 2 accumulators and combine them.
For your task you may use something like this (pseudo-code)
res = rdd.aggregate((0,0,0),
(acc, value) => (acc._1 + 1,
acc._2 + value.salary,
if (acc._3 > value.salary) then acc._3 else value.salary),
(acc1, acc2) => (acc1._1 + acc2._1,
acc1._2 + acc2._2,
if (acc1._3 > acc2._3) then acc1._3 else acc2._3))
val avg = res._2/res._1
val max = res._3
In this case we have:
(0,0,0)
- tuple of 3 elements representing, correspondingly: number of elements in RDD, sum of all salaries, and max salaryand then having number of entries, full sum of salaries, and max, we can find all necessary data.