apache-spark pyspark apache-spark-sql rdd

the usage of aggregate(0, lambda,lambda) in pyspark

There is a pyspark code segment

seqOp = (lambda x,y: x+y)
sum_temp = df.rdd.map(lambda x: len(x.timestamp)).aggregate(0, seqOp, seqOp)

The output of sum_temp is a numerical value. But I am not clear how does the aggregate(0, seqOp, seqOp) work. It seems to me that normally, the aggregate just use a single function form like "avg"

Moreover, df.rdd.map(lambda x: len(x.timestamp)) is of type pyspark.rdd.PipelinedRDD. How can we get its contents?

Solution

According to the docs, the aggregation process:

Starts from the first argument as the zero-value (0),
Then each partition of the RDD is aggregated using the second argument, and
Finally the aggregated partitions are combined into the final result using the third argument. Here, you sum up each partition, and then you sum up the sums from each partition into the final result.

You might have confused this aggregate with the aggregate method of dataframes. RDDs are lower-level objects and you cannot use dataframe aggregation methods here, such as avg/mean/etc.

To get the contents of the RDD, you can do rdd.take(1) to check a random element, or use rdd.collect() to check the whole RDD (mind that this will collect all data onto the driver and could cause memory errors if the RDD is huge).