Search code examples
javaapache-sparkcassandradatastaxdatastax-java-driver

How to do aggragation like avg, max sum in cassandra table using spark java api


I am having huge data in cassandra db, i want to do aggregation like avg, max and sum for some column name name using spark java api

I tried like below

cassandraRowsRDD
  .select("name", "age", "ann_salaray", "dept","bucket", "resourceid", "salaray")
  .where("timestamp = ?", "2018-01-09 00:00:00")
  .withAscOrder()

I saw this method - .aggregate(zeroValue, seqOp, combOp), but don't know how to use it

Expected :

max(salary column name)
avg(salary column name)

I have tried with CQL, getting failed because of huge data

Can any one give me an example for aggregation in cassandra tables using spark java api


Solution

  • the first parameter provides so-called "zero value" that is used to initialize "accumulator", 2nd parameter - function that takes accumulator & single value from your RDD, and 3rd parameter - function that takes 2 accumulators and combine them.

    For your task you may use something like this (pseudo-code)

    res = rdd.aggregate((0,0,0),
       (acc, value) => (acc._1 + 1,
                        acc._2 + value.salary,
                        if (acc._3 > value.salary) then acc._3 else value.salary),
       (acc1, acc2) => (acc1._1 + acc2._1,
                        acc1._2 + acc2._2,
                        if (acc1._3 > acc2._3) then acc1._3 else acc2._3))
     val avg = res._2/res._1
     val max = res._3
    

    In this case we have:

    1. (0,0,0) - tuple of 3 elements representing, correspondingly: number of elements in RDD, sum of all salaries, and max salary
    2. function that generate a new tuple from accumulator & value
    3. function that combines 2 tuples

    and then having number of entries, full sum of salaries, and max, we can find all necessary data.