There is a pyspark code segment
seqOp = (lambda x,y: x+y)
sum_temp = df.rdd.map(lambda x: len(x.timestamp)).aggregate(0, seqOp, seqOp)
The output of sum_temp is a numerical value. But I am not clear how does the aggregate(0, seqOp, seqOp)
work. It seems to me that normally, the aggregate
just use a single function form like "avg"
Moreover, df.rdd.map(lambda x: len(x.timestamp))
is of type pyspark.rdd.PipelinedRDD
. How can we get its contents?
According to the docs, the aggregation process:
You might have confused this aggregate with the aggregate method of dataframes. RDDs are lower-level objects and you cannot use dataframe aggregation methods here, such as avg/mean/etc.
To get the contents of the RDD, you can do rdd.take(1)
to check a random element, or use rdd.collect()
to check the whole RDD (mind that this will collect all data onto the driver and could cause memory errors if the RDD is huge).