Search code examples
scalaapache-sparkwindowpartition

Spark Scala Aggregate Function to Find number of occurrence of a column value in a group


I have following data:

group_id    id  name
----        --  ----
G1          1   apple
G1          2   orange
G1          3   apple
G1          4   banana
G1          5   apple
G2          6   orange
G2          7   apple
G2          8   apple

I want to find uniqe occurance count in each group. So far I have done this

val group = Window.partitionBy("group_id")
newdf.withColumn("name_appeared_count", approx_count_distinct($"name").over(group))

I want the result like this:

group_id    id  name   name_appeared_count
----        --  ----   -------------------
G1          1   apple       3
G1          2   orange      1
G1          3   apple       3
G1          4   banana      1
G1          5   apple       3
G2          6   orange      1
G2          7   apple       2
G2          8   apple       2

Thanks in advance!


Solution

  • Method approx_count_distinct($"name").over(group) counts distinct name per group, hence isn't what you want based on your expected output. Using count("name") over partition("group_id", "name") will produce the wanted counts:

    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.expressions.Window
    
    val df = Seq(
      ("G1", 1, "apple"),
      ("G1", 2, "orange"),
      ("G1", 3, "apple"),
      ("G1", 4, "banana"),
      ("G1", 5, "apple"),
      ("G2", 6, "orange"),
      ("G2", 7, "apple"),
      ("G2", 8, "apple")
    ).toDF("group_id", "id", "name")
    
    val group = Window.partitionBy("group_id", "name")
    
    df.
      withColumn("name_appeared_count", count("name").over(group)).
      orderBy("id").
      show
    // +--------+---+------+-------------------+
    // |group_id| id|  name|name_appeared_count|
    // +--------+---+------+-------------------+
    // |      G1|  1| apple|                  3|
    // |      G1|  2|orange|                  1|
    // |      G1|  3| apple|                  3|
    // |      G1|  4|banana|                  1|
    // |      G1|  5| apple|                  3|
    // |      G2|  6|orange|                  1|
    // |      G2|  7| apple|                  2|
    // |      G2|  8| apple|                  2|
    // +--------+---+------+-------------------+