Search code examples
scalaapache-sparkbigdatardddata-cleaning

Count the number of keys for each value of a set tagged to that key


I have a pair RDD like this:

id   value
id1  set(1232, 3,1,93,35)
id2  set(321,42,5,13)
id3  set(1233,3,5)
id4  set(1232, 56,3,35,5)

Now, I want to get the total count of ids per value contained in the set. So the output for the above table should be something like this:

set value  count
    1232   2
    1      1
    93     1
    35     2
    3      3
    5      3
    321    1
    42     1
    13     1
    1233   1
    56     1

Is there a way to achieve this?


Solution

  • yourrdd.toDF().withColumn(“_2”,explode(col(“_2”))).groupBy(“_2”).count.show