I have the following data where I need to group based on key and count the number based on key to monitor the metrics. I can use groupBy and do the count for that group but this involves some shuffle. Can we do without doing the shuffle ?
ID,TempID,PermanantID
----------
xxx, abcd, 12345
xxx, efg, 1345
xxx, ijk, 1534
xxx, lmn, 13455
xxx, null, 12345
xxx, axg, null
yyy, abcd, 12345
yyy, efg, 1345
yyy, ijk, 1534
zzz, lmn, 13455
zzz, abc, null
output should be
ID Count1 Count2
----------
XXX 5 5
YYY 3 3
ZZZ 2 1
I can do this with groupBy and count
dataframe.groupby("ID").agg(col("TempID").as("Count1"),count(col("PermanantID").as("Count2"))
can we do this using mapPartition ?
The question, whilst understandable, is flawed.
mapPartitions cannot be used directly on a dataframe, but on an RDD and Dataset.
Moreover, what about the partitioning and shuffling required prior to invoking the mapPartitions? Otherwise, the results will be incorrect. There is no mention of the guarantee of the order of the data initially in the question.
Hence, I would rely on the groupBy approach as postulated. It's an illusion to think that no shuffling is required in an App, rather we can reduce shuffling, that is the goal.