Search code examples
scaladataframeapache-sparkstreamingspark-structured-streaming

Using mapPartitions to avoid the shuffle with groupby and count


I have the following data where I need to group based on key and count the number based on key to monitor the metrics. I can use groupBy and do the count for that group but this involves some shuffle. Can we do without doing the shuffle ?

ID,TempID,PermanantID
----------

xxx, abcd, 12345

xxx, efg, 1345

xxx, ijk, 1534

xxx, lmn, 13455

xxx, null, 12345

xxx, axg, null

yyy, abcd, 12345

yyy, efg, 1345

yyy, ijk, 1534

zzz, lmn, 13455

zzz, abc, null

output should be

ID Count1 Count2
----------
XXX 5 5

YYY 3 3

ZZZ 2 1

I can do this with groupBy and count

dataframe.groupby("ID").agg(col("TempID").as("Count1"),count(col("PermanantID").as("Count2"))

can we do this using mapPartition ?


Solution

  • The question, whilst understandable, is flawed.

    mapPartitions cannot be used directly on a dataframe, but on an RDD and Dataset.

    Moreover, what about the partitioning and shuffling required prior to invoking the mapPartitions? Otherwise, the results will be incorrect. There is no mention of the guarantee of the order of the data initially in the question.

    Hence, I would rely on the groupBy approach as postulated. It's an illusion to think that no shuffling is required in an App, rather we can reduce shuffling, that is the goal.