Search code examples
hadoopapache-sparkhiveapache-spark-sqlapache-phoenix

How to add three column which are integer in spark sql aggregation


I have came across one issue is Spark sql aggregation. I have one dataframe from which I'm loading records from apache phoenix.

val df = sqlContext.phoenixTableAsDataFrame(
  Metadata.tables(A.Test), Seq("ID", "date", "col1", "col2","col3"),
  predicate = Some("\"date\" = " + date), zkUrl = Some(zkURL))

In another dataframe I need to aggregate on the basis of ID and date and then sum col1, col2, col3, i.e.

val df1 = df.groupBy($"ID", $"date").agg(
  sum($"col1" + $"col2" + $"col3").alias("col4"))

But I'm getting incorrect result while doing the sum. How we can sum all the columns (col1, col2, col3) and assign it to col4?

Example:

Suppose if data is like this:

ID,date,col1,col2,col3
1,2017-01-01,5,10,12
2,2017-01-01,6,9,17
3,2017-01-01,2,3,7
4,2017-01-01,5,11,13

Expected output:

ID,date,col4 
1,2017-01-01,27
2,2017-01-01,32
3,2017-01-01,12
4,2017-01-01,29

Solution

  • I get a correct result with this code:

    import org.apache.spark.rdd.RDD
    import org.apache.spark.sql.{DataFrame, Row}
    import org.apache.spark.sql.functions.{col, sum}
    import org.apache.spark.sql.types.{IntegerType,  StructField, StructType}
    
      val rowsRdd: RDD[Row] = spark.sparkContext.parallelize(
        Seq(
          Row(1, 1, 5, 10, 12 ),
          Row(2, 1, 6, 9,  17 ),
          Row(3, 1, 2, 3,  7),
          Row(4, 1, 5, 11, 13)
    
        )
      )
    
      val schema: StructType = new StructType()
        .add(StructField("id",    IntegerType,  false))
        .add(StructField("date",  IntegerType, false))
        .add(StructField("col1",  IntegerType, false))
        .add(StructField("col2",  IntegerType, false))
        .add(StructField("col3",  IntegerType, false))
      val df0: DataFrame = spark.createDataFrame(rowsRdd, schema)
    
      val df = df0.groupBy(col("id"), col("date")).agg(sum(col("col1") + col("col2") + col("col3")).alias("col4")).sort("id")
    
      df.show()
    

    Result is:

    +---+----+----+
    | id|date|col4|
    +---+----+----+
    |  1|   1|  27|
    |  2|   1|  32|
    |  3|   1|  12|
    |  4|   1|  29|
    +---+----+----+
    

    Is this what you need?